Superstore Sales Analytics: A KPI-Driven Visualization Approach¶

Unveiling Key Business Insights Through Data Visualization¶


A Data Analytics Project¶

Superstore

Author: Amos Vante, Agricultural Economist & Data Analyst¶

Date: June 2025¶

Tools Used: Python, Pandas, Matplotlib, Seaborn¶

Dataset: Superstore (2024)¶


** Project Overview**¶

This project explores sales trends, profitability patterns, and regional performance within the Superstore dataset.
By leveraging data visualization techniques, we extract actionable insights to enhance business decision-making, optimize pricing strategies, and improve inventory management.


Table of Contents¶

  1. Section 1: Introduction
  2. Section 2: Data Wrangling
  3. Section 3: Data Analysis and Visualizations
  4. Section 4: Conclusion and Recommendations
  5. References

===============================================================================================================================

Section 1: Introduction¶

===============================================================================================================================

1.1. Problem Statement¶

In today’s competitive retail environment, understanding sales dynamics, customer behavior, and operational efficiencies is vital for sustained profitability. Retailers often struggle to identify trends, predict customer needs, and optimize their processes due to the vast amount of transactional data generated daily. This project seeks to address these challenges by leveraging data visualization to extract actionable insights from the Superstore dataset, empowering stakeholders to make informed decisions.

1.2. Importance of Data Visualization in Retail Analytics¶

Data visualization is a powerful tool for simplifying complex datasets, uncovering patterns, and driving actionable insights. In the retail sector, it enables:

  • Trend Identification: Recognizing sales peaks and operational inefficiencies.
  • Customer Understanding: Profiling high-value customers and their preferences.
  • Strategic Decision-Making: Providing stakeholders with clear insights for optimizing pricing, marketing, and supply chain operations. By transforming raw data into intuitive visual representations, businesses can quickly adapt to market trends and improve their competitive edge.

1.3. Project Objectives¶

The primary goal of this project is to analyze the Superstore dataset to uncover insights into retail performance through comprehensive analysis and visualization. This will be achieved through the following core objectives:

  • 1. Business Metrics Overview: Identify and present the critical metrics used to assess performance.

  • 2. Overall Performance Analysis: Calculate and visualize key metrics over the entire period to provide a holistic view of business performance.

  • 3. Time-Series Insights:Examine key performance metrics over different time intervals (yearly, quarterly, and monthly) to identify trends and seasonal patterns.

  • 4. Segment Performance Analysis: Evaluate product, customer, and regional performance to identify key drivers of sales and profitability.

  • 5. Correlation Analysis : Examine relationships between key metrics, with a focus on the impact of discounts on profitability.

  • 6. Strategic Recommendations : Derive actionable recommendations to optimize business strategies and enhance overall profitability.

1.4. Scope and Limitations¶

This analysis focuses exclusively on the United States, with data related to Canada excluded. The study is based solely on the internal transactional data available in the Superstore dataset. While the findings provide valuable insights into retail performance, certain limitations should be noted:

  • Exclusion of External Factors: The analysis does not account for external influences such as market trends, competitor strategies, or customer sentiment, which may also impact retail performance.
  • Assumption of Data Completeness: It is assumed that all transactions in the dataset are accurately recorded and reflective of the business's actual operations. Any errors or omissions in the data could affect the findings.
  • Focus on Historical Performance: This project emphasizes analyzing past performance and does not include predictive modeling or forecasting for future trends. Despite these limitations, the insights derived from this analysis will serve as a strong foundation for informed, data-driven decision-making in retail operations.

1.5. Expected Outcomes¶

By the end of the project, we aim to:

  • Highlight trends and seasonal patternsn the main mtrics.
  • Identify key categories of products, best-selling and profitable products, key customer segments and their behaviors and region's performance.
  • Uncover relationships between operational metrics.
  • Recommend strategies for boosting profitability and efficiency.

===============================================================================================================================

Section 2: Data Wrangling

Data wrangling (also called data munging) is the process of cleaning, transforming, and organizing raw data into a structured and usable format for analysis. It is an essential step in any data project, ensuring that the dataset is accurate, complete, and ready for meaningful insights.

Key Steps in Data Wrangling include:

1) Data Collection : Gathering data (from sources such as CSV files, databases, APIs, or web scraping,..).

2) Initial Exploration : Inspecting the structure of the data, identifying data types and detecting anomalies or inconsistencies.

3) Data Cleaning : Converting data types,handling missing values,removing duplicates or irrelevant columns or rows,correcting or standardizing inconsistent data entries, handling outliers (if necessary)

4) Data Transformation : Creating new features or columns, normalizing or scaling numerical data,encoding categorical variables into numerical formats.

5) Data Aggregation : Summarizing data at higher levels (e.g., total sales by region, average profit by category).

2.1. Data Gathering¶

2.1.1. Data description¶

In this project, we utilize the Superstore dataset, a well-known dataset available online in multiple formats. For this analysis, we work with an Excel file containing over 10,000 transactional records from a retail business operating across the United States and Canada. The dataset spans a four-year period from January 2021 to December 2024, offering detailed insights into various aspects of the business through four primary dimensions:

  • Products: Detailed categorization of items sold, including categories and subcategories.
  • Customers: Demographics such as regions, states, and customer segments (e.g., Consumer, Corporate).
  • Orders: Transaction-level details, including sales, profit, quantities, and discounts.
  • Shipping: Shipping methods and delivery times. Each record represents a transaction, enabling analysis of sales performance, customer segmentation, and operational efficiency at both granular and macro levels.

Each entry corresponds to a single transaction, allowing for the analysis of sales performance, customer segments, and operational efficiency.

2.1.2. Importing necessary packages and loading data¶

To conduct the project effectively, we will rely on a set of Python libraries that provide essential functionalities for various stages of the workflow, including data loading, exploration, cleaning, preprocessing, visualization, and analysis.

# a)  Importing key libraries :
import pandas as pd      # For data manipulation and analysis, particularly for handlingdata in DataFrame format.
import numpy as np       # For numerical operations
import matplotlib.pyplot as plt  # For creating static visualizations
import seaborn as sns  # For more advanced visualization (it is built on matplotlib)


# b) Data Loading (reading the excel data file) 
data= pd.read_excel('Sample - Superstore 2024.xlsx')

# c) Quick preview to verify loading success
print("Superstore dataset:")
print(data.head(3))
Superstore dataset:
   Row ID        Order ID Order Date  Ship Date       Ship Mode Customer ID  \
0       1  US-2021-103800 2021-01-03 2021-01-07  Standard Class    DP-13000   
1       2  US-2021-112326 2021-01-04 2021-01-08  Standard Class    PO-19195   
2       3  US-2021-112326 2021-01-04 2021-01-08  Standard Class    PO-19195   

   Customer Name      Segment Country/Region        City  ... Postal Code  \
0  Darren Powers     Consumer  United States     Houston  ...       77095   
1  Phillina Ober  Home Office  United States  Naperville  ...       60540   
2  Phillina Ober  Home Office  United States  Naperville  ...       60540   

    Region       Product ID         Category Sub-Category  \
0  Central  OFF-PA-10000174  Office Supplies        Paper   
1  Central  OFF-BI-10004094  Office Supplies      Binders   
2  Central  OFF-LA-10003223  Office Supplies       Labels   

                                        Product Name   Sales  Quantity  \
0  Message Book, Wirebound, Four 5 1/2" X 4" Form...  16.448         2   
1         GBC Standard Plastic Binding Systems Combs   3.540         2   
2                                          Avery 508  11.784         3   

   Discount  Profit  
0       0.2  5.5512  
1       0.8 -5.4870  
2       0.2  4.2717  

[3 rows x 21 columns]

2.1.3. Filtering the dataset¶

The dataset includes transaction records from both the United States and Canada,as mentioned earlier. However, to align with our analysis objectives, we will filter the dataset to focus exclusively on transactions from the United States.

# 2.1.3.1.Confirm that the dataset contains data for both the United States and Canada by checking the unique values in the Country/Region column.
print(data['Country/Region'].unique())
['United States' 'Canada']
# 2.1.3.2.Filter data to focus on transactions within the United States only
df = data[data['Country/Region'] == 'United States']

# Verify the filtering
print(df['Country/Region'].unique())
['United States']
# 2.1.3.3. Working with a copy of the filtered DataFrame
df = df.copy()

#------------------------------------------ Note-------------------------------------------------------------------------------
# Since the DataFrame df was created by filtering another DataFrame (data), df becomes a view of the original data. 
# Any modifications to df might unintentionally affect the original data DataFrame or result in a SettingWithCopyWarning when attempting to modify df.
# To prevent this and ensure that df is treated as a completely independent DataFrame, we will work on a copy of the filtered DataFrame.
#This approach eliminates potential warnings and avoids unintended side effects when modifying df .
# N.B: Another option would be to modify the column  in question explicitly with .loc to avoid ambiguity, eg: df.loc[:, 'discount'] = df['discount'] * 100

2.2. Initial Exploration¶

The initial exploration involves assessing the dataset's structure, understanding its content, and reviewing data types and sample values.

Initial exploration refers to the preliminary step in the data analysis process where you familiarize yourself with the dataset by assessing its structure, understanding its content, and identifying potential issues. It involves performing basic checks and generating summary statistics to gain an initial understanding of the data's characteristics, which sets the foundation for subsequent cleaning, preprocessing, and analysis.

Key Checks :

  • Shape of the Dataset: To determine the number of rows (observations) and columns present. This helps gauge the dataset's size and complexity.

  • Columns and Data Types: To identify all columns and their respective data types (e.g., numerical, categorical, datetime) and ensure the data types are appropriate for analysis (e.g., dates in datetime format, numerical values as integer of float,...).

  • Data Preview: To View the first few rows of the dataset to get an idea of the structure and content.

  • Descriptive Statistics for Columns: To generate summary statistics for numerical columns (e.g., mean, median, standard deviation). For categorical columns, analyze unique values and their counts.

  • Missing values: Checking for missing or null values.

  • Duplicates: Identifying any duplicate rows in the dataset.

2.2.1. Shape of the dataset¶

# Original dataset shape
data_shape=data.shape
print("a) Shape of the original dataset:", data_shape)
print(f" Thus,the original dataset contains {data_shape[0]} rows and {data_shape[1]} columns.\n")

# Our filtered dataset shape (US data only)
df_shape=df.shape
print("b) Shape of our dataset containing US records only:", df_shape)
print(f"That means that our dataset contains {df_shape[0]} rows and {df_shape[1]} columns.\n")

# Canada's Record Count:
canada_count = data_shape[0] - df_shape[0]
print(f"c) Number of records for Canada: {canada_count}\n")

# Percentage of US records
us_proportion = (df_shape[0] / data_shape[0]) * 100
print(f"d) Percentage of US records in the original dataset: {us_proportion:.2f}%")
a) Shape of the original dataset: (10194, 21)
 Thus,the original dataset contains 10194 rows and 21 columns.

b) Shape of our dataset containing US records only: (9994, 21)
That means that our dataset contains 9994 rows and 21 columns.

c) Number of records for Canada: 200

d) Percentage of US records in the original dataset: 98.04%

Observations¶

  • The dataset contains 98% of records for the United States, indicating a significant focus on the US market compared to Canada.
  • This suggests that the dataset is predominantly skewed toward US-specific transactions.

2.2.2. Identification of the columns and their data types¶

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9994 entries, 0 to 10192
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Row ID          9994 non-null   int64         
 1   Order ID        9994 non-null   object        
 2   Order Date      9994 non-null   datetime64[ns]
 3   Ship Date       9994 non-null   datetime64[ns]
 4   Ship Mode       9994 non-null   object        
 5   Customer ID     9994 non-null   object        
 6   Customer Name   9994 non-null   object        
 7   Segment         9994 non-null   object        
 8   Country/Region  9994 non-null   object        
 9   City            9994 non-null   object        
 10  State/Province  9994 non-null   object        
 11  Postal Code     9994 non-null   object        
 12  Region          9994 non-null   object        
 13  Product ID      9994 non-null   object        
 14  Category        9994 non-null   object        
 15  Sub-Category    9994 non-null   object        
 16  Product Name    9994 non-null   object        
 17  Sales           9994 non-null   float64       
 18  Quantity        9994 non-null   int64         
 19  Discount        9994 non-null   float64       
 20  Profit          9994 non-null   float64       
dtypes: datetime64[ns](2), float64(3), int64(2), object(14)
memory usage: 1.7+ MB

Observations:¶

  • Our filtered dataset df contains 21 columns and 9994 rows.

  • The number of non-null entries in each column is 9994, which means there are no missing values in any column.

  • It may be beneficial to rename some (or all) columns to follow best practices and enhance readability by adopting a consistent naming convention in Python.

  • The data types of the columns appear correct. For example:

    • datetime: Order Date and Ship Date
    • int: Quantity
    • float: Sales, Profit, and Discount
    • Object (String): Variables like Region, Category, Postal Code, etc.

Note: The Object type is correct for the Postal Code column since it is typically used as an identifier and not for numerical operations. This ensures correct handling, such as preserving leading zeros (if any).

2.2.3. Double-Checking for missing values (Since we've already noted there are no missing values)¶

df.isna().sum()
Row ID            0
Order ID          0
Order Date        0
Ship Date         0
Ship Mode         0
Customer ID       0
Customer Name     0
Segment           0
Country/Region    0
City              0
State/Province    0
Postal Code       0
Region            0
Product ID        0
Category          0
Sub-Category      0
Product Name      0
Sales             0
Quantity          0
Discount          0
Profit            0
dtype: int64

2.2.4 Checking for duplicates¶

number_duplicates=df.duplicated().sum()

# Print the number of duplicated rows
print("Total Number of Duplicated Rows:", number_duplicates)
Total Number of Duplicated Rows: 0

2.2.5. Data Preview¶

# View randomly 5 rows of the dataset (to get a diverse representation of the data).
df.sample(5)
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country/Region City ... Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
4186 4187 US-2023-116365 2023-01-03 2023-01-08 Standard Class CA-12310 Christine Abelman Corporate United States San Antonio ... 78207 Central TEC-AC-10002217 Technology Accessories Imation Clip USB flash drive - 8 GB 30.080 2 0.2 -5.2640
415 416 US-2021-107811 2021-04-29 2021-05-03 Standard Class LA-16780 Laura Armstrong Corporate United States Memphis ... 38109 South FUR-CH-10001394 Furniture Chairs Global Leather Executive Chair 561.584 2 0.2 70.1980
8791 8792 US-2024-100111 2024-09-20 2024-09-26 Standard Class SV-20365 Seth Vernon Consumer United States New York City ... 10035 East FUR-CH-10001215 Furniture Chairs Global Troy Executive Leather Low-Back Tilter 2254.410 5 0.1 375.7350
8240 8241 US-2024-155740 2024-07-30 2024-08-02 First Class TC-21475 Tony Chapman Home Office United States Lakewood ... 44107 East OFF-BI-10001071 Office Supplies Binders GBC ProClick Punch Binding System 76.776 4 0.7 -53.7432
5282 5283 US-2023-111290 2023-07-22 2023-07-26 Standard Class DK-13375 Dennis Kane Consumer United States Westland ... 48185 Central TEC-AC-10004975 Technology Accessories Plantronics Audio 995 Wireless Stereo Headset 109.950 1 0.0 36.2835

5 rows × 21 columns

Observations:¶

This preview provides not only a quick snapshot of the dataset's structure and content but also allows us to verify that the data is recorded in a consistent manner and free of obvious errors.

2.2.6. Descriptive Statistics¶

Descriptive statistics provide a clear and concise summary of the essential characteristics of a dataset. This process enables a deeper understanding of the data's distribution, central tendency, and variability, offering insights into patterns, outliers, and potential quality issues. It serves as a cornerstone for guiding further analysis and determining necessary data cleaning, transformations, or corrections.

We will follow a structured three-step approach:

  1. Categorical (object) Columns: Analyze unique and top values, extract and display distinct values for better understanding, and identify inconsistencies such as misspellings and redundant categories.
  2. Numerical Columns: Compute key statistics (mean, median, standard deviation, etc.) to assess variability, outliers, and data quality.
  3. Visualizations: Use boxplots, histograms, and scatterplots to intuitively highlight outliers, trends, and relationships.

This approach ensures a comprehensive understanding of the data and its readiness for deeper analysis.

2.2.6.1. Examination of Categorical Columns¶

# 2.2.6.1.1. Describe the object columns
df.describe(include=['object'])
Order ID Ship Mode Customer ID Customer Name Segment Country/Region City State/Province Postal Code Region Product ID Category Sub-Category Product Name
count 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994
unique 5009 4 793 793 3 1 531 49 631 4 1862 3 17 1849
top US-2024-100111 Standard Class WB-21850 William Brown Consumer United States New York City California 10035 West OFF-PA-10001970 Office Supplies Binders Staple envelope
freq 14 5968 37 37 5191 9994 915 2001 263 3203 19 6026 1523 48

Main Insights:¶

1. Unique Values:

  • Orders : The dataset contains 5,009 distinct orders.
  • Customers : There are 793 unique customers, spanning 3 customer segments, showcasing a diverse customer base.
  • Products : A total of 1,862 distinct products were sold.
  • Categories and Sub-Categories : Products are grouped into 3 main categories and further divided into 17 sub-categories.
  • Regions and States : Customers are distributed across 4 regions and 49 states, demonstrating extensive geographical coverage.

2. Top and Frequency:

  • Most Frequent Segment : The Consumer segment is the most prevalent, appearing 5,191 times, suggesting that individual consumers account for a large share of sales.
  • Most Frequent Category : Office Supplies is the leading category, highlighting a strong focus on office-related products.
  • Most Common Sub-Category: Binders are the most frequent sub-category, indicating high demand in this area.
  • Top Product: The Staple Envelope is the most frequently sold product, reflecting consistent consumer preferences.
  • Most Frequent Region: The West region is the most represented, contributing the majority of the orders.
  • Most Common State: California is the most active state with the highest number of transactions.
  • Most Frequent Customer: William Brown appears in 37 transactions, indicating he could be a loyal customer or part of a large organization.
# 2.2.6.1.2. Extracting and Displaying Values of the object Columns

# a) Extract unique values 
ship_modes = df['Ship Mode'].unique()
segments = df['Segment'].unique()
categories = df['Category'].unique()
regions = df['Region'].unique()

# Display the results
print("\n1.Ship Modes:",','.join(ship_modes))
print("\n2.Segments:", ','.join(segments))
print("\n3.Categories:", ','.join(categories))
print("\n4.Regions:", ','.join(regions))
#-----------------------------------------=-----------------# Note-----------------------------------------------------
# Rather than displaying the output as a pandas Series (e.g., Ship Modes = ['Standard', 'First Class', ...]), 
# We prefer to format the output into a more readable string. 
# For this reason, We  use the separator.join(iterable) method to join the elements of the Series (which is an iterable) into a single string.
# Each element will be separated by a specified delimiter' —'in this case. 
# This approach will give us a cleaner and more readable output, like: Ship Modes: Standard Class, First Class, Same Day, Second Day
#------------------------------------------------------------------------------------------------------------------------------
# b) Group Subcategories by Category

grouped_subcat= df.groupby("Category")["Sub-Category"].unique() #  The output is a pandas series

# Iterate over the pandas Series
print("\n5.Subcategories by Category:")
print('='*50)
for category, subcategory in grouped_subcat.items():
    print(f"{category}: {subcategory}")
#----------------------------------------------------------------------------------------------------------------------------
# c) States by region

# Group states by region
grouped_reg = df.groupby("Region")["State/Province"].unique() 

# Iterate over the pandas Series
print("\n6.States by Region:")
print('='*50)
for region, states in grouped_reg.items():
    print(f"{region}: {states}")
    print(''*50)
    
#-------------------------------------------# Note----------------------------------------------------------------------
# Rather than directly printing the grouped_data pandas Series,we want the output to be displayed as:Region: ['Texas', 'Illinois', ...]
# To achieve this, we will iterate over the grouped_data Series using .items().
# This method allows us to retrieve both the index (which represents the region) and the value (which contains the list of states).   
1.Ship Modes: Standard Class,First Class,Second Class,Same Day

2.Segments: Consumer,Home Office,Corporate

3.Categories: Office Supplies,Furniture,Technology

4.Regions: Central,East,South,West

5.Subcategories by Category:
==================================================
Furniture: ['Chairs' 'Furnishings' 'Bookcases' 'Tables']
Office Supplies: ['Paper' 'Binders' 'Labels' 'Storage' 'Art' 'Fasteners' 'Envelopes'
 'Appliances' 'Supplies']
Technology: ['Phones' 'Accessories' 'Machines' 'Copiers']

6.States by Region:
==================================================
Central: ['Texas' 'Illinois' 'Michigan' 'Indiana' 'South Dakota' 'Wisconsin'
 'Missouri' 'Minnesota' 'Iowa' 'Oklahoma' 'Nebraska' 'Kansas'
 'North Dakota']

East: ['Pennsylvania' 'Delaware' 'Ohio' 'New York' 'New Jersey' 'Massachusetts'
 'Maryland' 'Connecticut' 'New Hampshire' 'Maine' 'Rhode Island'
 'District of Columbia' 'Vermont' 'West Virginia']

South: ['Kentucky' 'Georgia' 'Virginia' 'South Carolina' 'Louisiana' 'Arkansas'
 'Tennessee' 'Florida' 'North Carolina' 'Mississippi' 'Alabama']

West: ['California' 'Oregon' 'Arizona' 'Nevada' 'Washington' 'Colorado' 'Utah'
 'New Mexico' 'Idaho' 'Montana' 'Wyoming']

Observations:¶

The object columns were thoroughly evaluated for consistency, completeness, and accuracy:

  • Consistency : No duplicate or inconsistent labels (like typos, mixed formats, or mislabeled categories) were found in any column.
  • Completeness: All key categorical fields had no missing or null values.
  • Hierarchical Relationships: Sub-categories were verified for correct alignment with their parent categories, ensuring that no relationships were misrepresented. Similarly, states were accurately assigned to their respective regions, confirming that the hierarchical structure was intact.

Conclusion: These columns are clean, complete, and ready for analysis.

2.2.6.2. Examination of numerical columns¶

# Describe the numerical columns ('Row ID' is excluded as it is considered integer)
df_numeric = df.select_dtypes(include=['number']).drop(columns=['Row ID']) 
# We select the numerical columns (while excluding 'Row ID', as it is not relevant)

df_numeric.describe().round(2) 
Sales Quantity Discount Profit
count 9994.00 9994.00 9994.00 9994.00
mean 229.86 3.79 0.16 28.66
std 623.25 2.23 0.21 234.26
min 0.44 1.00 0.00 -6599.98
25% 17.28 2.00 0.00 1.73
50% 54.49 3.00 0.20 8.67
75% 209.94 5.00 0.20 29.36
max 22638.48 14.00 0.80 8399.98

Main Insights and Potential Issues Identified :¶

1. Total Quantity:

  • Data Completeness: The dataset contains 9,994 quantity records with no missing values, ensuring reliability for analysis, a characteristic shared by all variables in the dataset.

  • Order Size Range: Order sizes vary from 1 to 14 items per order. While the upper limit suggests bulk purchases or potential outliers, further investigation may be required.

  • Distribution Characteristics :

    • The average order size is 3.79 items, while the median is 3 items, indicating a slightly right-skewed distribution. Most orders are small, as reflected by the median value.
    • A standard deviation of 2.23 shows moderate variability in order sizes, with most orders falling within a consistent range.
    • The interquartile range (IQR) reveals that middle 50% of orders are between 2 and 5 items, highlighting the dominance of smaller orders.
    • Percentiles: The 75th percentile is 5 items, meaning three-quarters of orders contain no more than 5 items. However, the maximum of 14 items suggests outliers in the upper quartile.

2. Total Sales:

  • Range of Sales : Sales values span from USD 0.44 (smallest transaction) to USD 22,638.48 (largest transaction), indicating substantial variability. The upper end may include outliers.

  • Distribution Characteristics:

    • The average daily sales value is USD 229.86, while the median is USD 54.49, reflecting a right-skewed distribution. Most transactions are low-value, but a few high-value sales significantly impact the average.
    • A standard deviation of USD 623.25 indicates high variability in transaction values.
    • The IQR (USD 17.28 to USD 209.94) represents typical sales activity.
    • Skewed Distribution: A few high-value transactions disproportionately affect revenue, suggesting the need for segmentation and analysis of significant sales drivers.

3. Total Profit:

  • Profit Range : Profits range from -USD 6,599.98 (significant losses) to USD 8,399.98 (substantial gains).

    • Losses may result from steep discounts, high costs, or operational inefficiencies, requiring closer examination.
    • High profits may stem from high-margin products or bulk sales, offering opportunities to maximize profitability.

  • Distribution Characteristics :

    • The average profit is USD 28.66, while the median is USD 8.67, suggesting a right-skewed distribution, likely influenced by outliers.
    • A standard deviation of USD 234.26 reflects a wide spread of profitability.
    • The IQR (USD 1.73 to USD 29.36) indicates that most profits are modest.
    • Outliers: The gap between the maximum profit (USD 8,399.98) and the 75th percentile (USD 29.36) suggests the presence of upper-end outliers.

4. Discount:

  • Range of Discounts: Discounts range from 0% to 80%. High discounts may indicate clearance sales, promotions, or price adjustments aimed at stimulating demand.

  • Impact on Profitability :

    • While discounts can increase sales volume, excessive discounts may lead to losses, especially for low-margin products.
    • Understanding the relationship between discounts and profitability is critical for optimizing pricing strategies that strike a balance between volume growth and margin preservation.
  • Distribution Characteristics :

    • The average discount is 16%, and the median is 20%, indicating that discounts are generally moderate.
    • The 75th percentile value of 20% shows that most discounts are below or equal to this level.
    • The maximum discount of 80% represents a significant deviation, possibly indicating outliers.

Note: While descriptive measures offer an overview of potential data issues, further analysis using visual tools is essential. See the following sections for more insights.

2.2.6.3. Determination of the number of outliers and Visualizations¶

#---------------------------------------------------Note: Outliers----------------------------------------------------

# Outliers are extreme values that differ significantly from other data points in a dataset. 
# They can distort statistical analyses, so it is important to detect and handle them accordingly (if necessary). 
# For example, outliers may:
    # - skew the mean and Standard deviation
    # - affect correlation and regression models 
    # - Violate assumptions (normality or homogeneity of variance in the data). 
    # - distort visualizations 
# While they can obscure patterns, outliers can also highlight interesting or unusual observations worthy of further investigation.

# There are two commonly used methods for outlier detection :
         # Z-score Method:
# This methods detects outliers by measuring the number of standard deviations a value is from the mean: z = (x– μ)/σ.
# Values typically considered outliers if Z-scores are greater than 3 (a commonly used threshold) or less than -3. 
# which can also be expressed as : |Z| > 3. Any value with |Z| > 3  is considered outlier.
# Drawback: Since it relies on the mean and standard deviation,it can be less reliable for skewed distributions or data with heavy tails.
# So, it is best suited for normally distributed data.


         # IQR Method: 
# This mthods detects outliers based on the interquartile range (IQR), the difference between the 25th (Q1) and 75th (Q3) percentiles.
# Outliers are values falling outside a range of [𝑄1−1.5×𝐼𝑄𝑅,  𝑄3+1.5×𝐼𝑄𝑅]
# Advantages comapred to z-score:Since it’s based on percentiles, it is less influenced by extreme values, making it more robust for skewed data.
# So, it is more effective for skewed or non-normal distributions (that could distort the analysis). 

                       # N.B:
# Given the skewness in our data, we'll prioritize results from the IQR method 
# as it is more robust to extreme values compared to the Z-Score approach.
# 2.2.6.3.1.Determining the number of outliers (both methods)

# 1) Selection of the numerical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.drop('Row ID')

# 2) Initialize an empty list to store outlier counts for each column
# The list will hold tuples of the form: (column_name, Z-score_outlier_count, IQR_outlier_count)
outlier_counts= []

# 3) Define the threshold for the Z-score method
threshold = 3 # 
# N.B: threshold  is a global constant that is defined outside the loop, 
# just to avoid re-assigning the same constant value repeatedly inside the loop, improving both efficiency and clarity.

# 4) Iterate over each numerical column to calculate both Z-score and IQR-based outliers
for col in numerical_cols:
    # Outliers using Z-score method
    z_scores = (df[col] - df[col].mean()) / df[col].std()  # Calculate Z-scores
    z_outliers = np.abs(z_scores) > threshold  # Boolean masks to identify outliers (True when outlier) 
    num_z_outliers = z_outliers.sum()  # Count outliers 
    
    # Outliers using the IQR method
    Q1 = df[col].quantile(0.25)  # calculating the first quartile
    Q3 = df[col].quantile(0.75)  # calculating the third quartile
    IQR = Q3 - Q1  
    lower_bound = Q1 - 1.5 * IQR  # Defining the lower bound for outliers
    upper_bound = Q3 + 1.5 * IQR  # Defining the upper bound for outliers
    iqr_outliers = (df[col] < lower_bound) | (df[col] > upper_bound)  # Identify outliers
    num_iqr_outliers = iqr_outliers.sum()  # Count outliers 
    
    # Append the results to the empty list outlier_all
    outlier_counts.append((col, num_z_outliers, num_iqr_outliers)) # Tuple structure:(column_name, Z-score_outlier_count, IQR_outlier_count)
    
# 5. Display the count of outliers identified by both methods.
                                      # Note
# If we directly print(outlier_all), the results would appear as a list of tuples, e.g.:
# [('Sales', 107, 1167), ('Quantity', 107, 170), ('Discount', 107, 856), ('Profit', 107, 1881)]
# While this is accurate, it is not very readable. To provide a cleaner and more user-friendly output, 
# we will iterate over the outlier_counts list (which is an iterable) and use Python's string formatting capabilities
# particularly alignment and width options to format the output into a well-structured table, as shown below:

# Iterate over outlier_counts and format the output:
print("\nOutlier counts from Z-score and IQR method:")
print(f"\n{'Column':<15} {'Z-Score Outliers':<20} {'IQR Outliers':<15}")
print("=" * 50)
for column, z_outliers, iqr_outliers in outlier_counts:
    print(f"{column:<15} {z_outliers:<20} {iqr_outliers:<15}")
Outlier counts from Z-score and IQR method:

Column          Z-Score Outliers     IQR Outliers   
==================================================
Sales           127                  1167           
Quantity        113                  170            
Discount        300                  856            
Profit          107                  1881           

Observations:¶

  • The Z-score method identifies fewer outliers compared to the IQR method for all columns. This is expected because :

    • The Z-score method assumes a normal distribution of data, making it less sensitive to extreme deviations in highly skewed or non-normally distributed data.
    • The IQR method is more robust and detects more outliers, especially in datasets with skewed distributions or long tails
  • There is a large gap in outlier counts between the Z-score and IQR methods for the variables Profit, Sales and Discount. This suggests that these variables might have highly skewed or long-tailed distributions, where many data points lie far from the interquartile range but remain within 3 standard deviations of the mean.

  • Regarding the quantity variable, the gap is much smaller between the two methods, that would indicate that this variable might be less skewed or closer to a normal distribution compared to the others, making both methods more consistent in their detection of outliers.

¶

#-------------------------------------------- Note: Histogram and Boxplot -----------------------------------------------------------------------
                                             
                                              # Histogram
# A histogram is a graphical representation of the distribution of a dataset. 
# It displays the frequency (or count) of data points within specific intervals, called bins, along the range of a variable.
# This makes it a fundamental tool for understanding the shape, spread, and central tendency of the data, as well as for detecting patterns, skewness, and potential outliers.
# A histogram is the most direct way to examine the distribution of data and identify skewness.
# An important consideration when plotting a histogram is determining the optimal number of bins.
# The choice of bins can significantly impact the visualization:
      # - Too few bins may oversimplify the data, obscuring important details.
      # -  Too many bins may introduce noise, making patterns harder to identify.
# In the literature, several methods have been proposed to determine the optimal number of bins. 
# Some common ones include: Sturges' Rule, Square Root Rule,Scott’s Rule, Freedman-Diaconis Rule, Doane’s Formula,...
# Many of these methods are implemented in Matplotlib and Seaborn.

                                                    # Boxplot

# Box plots provide a concise visualization of a variable's distribution, highlighting its spread, central tendency, and outliers.

# Key Components:
 #- Box: Represents the interquartile range (IQR), which contains the middle 50% of the data.
 #- Top of the Box : Third quartile (75th percentile).
 #- Bottom of the Box: First quartile (25th percentile).
 #- Median Line: A line within the box indicates the median (50th percentile) of the dataset.
 #- Whiskers: Extend to the smallest and largest data points within 1.5 times the IQR from the box.
 #- Lower Whisker: Minimum value within 1.5 * IQR below the first quartile.
 #- Upper Whisker: Maximum value within 1.5 * IQR above the third quartile.
 #- Outliers: Data points beyond the whiskers are plotted as individual dots, highlighting unusually high or low values.

# N.B:  We'll plot histograms  and boxplots for each column to visually inspect the distribution
# and confirm the presence of outliers.
### 2.2.6.3.2 Visualization: Histogram and Boxplot

# Create a 4x2 grid of subplots for histograms and box plots
fig, axes = plt.subplots(4, 2, figsize=(18, 18))

# Set Seaborn theme 
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16, "axes.titleweight": "bold", "axes.labelsize": 14, "axes.labelweight": "bold",
        "legend.fontsize": 11, "xtick.labelsize": 12, "ytick.labelsize": 12})

# Here, we use the sns.set_theme() function to:
#   - Set the style: The default 'darkgrid' is changed to 'whitegrid' for a cleaner appearance  
#     with subtle grid lines, enhancing readability while maintaining structure.  
#   - Adjust key elements globally: This includes title sizes, font weights, legend font size,  
#     and tick label sizes, ensuring a consistent and professional look across all plots in the script.  
#   - Avoid redundant styling: By setting these configurations globally, we eliminate the need  
#     to specify them individually for each plot, making the code cleaner and more efficient.
#----------------------------------------------------------------------------------------------------------------------
# 1. Visualization for Quantity

# Histogram 
sns.histplot(data=df, x='Quantity', color='skyblue', ax=axes[0, 0], bins='fd', stat='percent') 
axes[0, 0].set_title('Histogram of Quantity')  
axes[0, 0].set_ylabel('Percent')

# N.B: Freedman-Diaconis Rule is used as the variables are skewed. 
# We use stat='percent' for discrete variables and stat='count' for purely continuous variables.

# Boxplot
sns.boxplot(data=df, x='Quantity', color='skyblue', ax=axes[0, 1],fliersize=5)
axes[0, 1].set_title('Boxplot of Quantity')  
#------------------------------------------------------------------------------------------------------------------------------
# 2. Visualization for Sales

# Histogram 
sns.histplot(data=df, x='Sales', color='skyblue', ax=axes[1, 0], kde=True, bins='fd')
axes[1, 0].set_title('Histogram of Sales')
axes[1, 0].set_ylabel('Count')

# Boxplot 
sns.boxplot(data=df, x='Sales', color='skyblue', ax=axes[1, 1],fliersize=5)
axes[1, 1].set_title('Boxplot of Sales')  
#------------------------------------------------------------------------------------------------------------------------------
# 3.Visualization for Profit
 
# Histogram  
sns.histplot(data=df, x='Profit', color='skyblue', ax=axes[2, 0], kde=True, bins='fd')
axes[2, 0].set_title('Histogram of Profit')  
axes[2, 0].set_ylabel('Count')

# Boxplot
sns.boxplot(data=df, x='Profit', color='skyblue', ax=axes[2, 1],fliersize=5)
axes[2, 1].set_title('Boxplot of Profit')  
axes[2, 1].set_xlabel('Profit')
#-------------------------------------------------------------------------------------------------------------------------------
# 4.Visualization for Discount

# Histogram 
sns.histplot(data=df, x='Discount', color='skyblue', ax=axes[3, 0], bins='fd', stat='percent')
axes[3, 0].set_title('Histogram of Discount') 
axes[3, 0].set_xlabel('Discount')
axes[3, 0].set_ylabel('Percent')

# Boxplot
sns.boxplot(data=df, x='Discount', color='skyblue', ax=axes[3, 1],fliersize=5)
axes[3, 1].set_title('Boxplot of Discount')  
axes[3, 1].set_xlabel('Discount')
#---------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flatten():
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.yaxis.grid(True, linestyle='--', alpha=0.7)  #  Enabling the grid for the y-axis only.
#--------------------------------------------------------------------------------------------------------------------------- 
# Adjust Layout for Better Spacing
plt.tight_layout(pad=3)  

# Display the Plots
plt.show()
# ------------------------------------ Note: axes.flatten() ------------------------------------
# In Matplotlib, when creating a grid of subplots using plt.subplots(), the returned 'axes' object  
# is typically a 2D NumPy array of Axes instances.  
# Instead of using nested loops to iterate over all subplots, we can flatten this 2D array into  
# a 1D array using the flatten() method. This simplifies the process by allowing iteration in a single,making the code cleaner.  
No description has been provided for this image

Main Insights¶

The plots align closely with the descriptive analysis of the variables and confirm several key points:

1. Total Quantity Sold:

  • The histogram confirms a right-skewed distribution,the Quantity variable is right-skewed, with smaller quantities being far more common. More than 80% of sales involve quantities between 1 and 5 items, with 48% of transactions centered around 2 and 3 items. As the quantity increases beyond 3 items, the percentage of transactions drops sharply, highlighting a preference for smaller, more frequent purchases.

  • The boxplot highlights potential outliers, with order sizes beyond 9 exceeding the whiskers.The maximum observed value is 14 items, making it an extreme case.

2. Total Sales:

  • Sales distribution is heavily right-skewed, with a large concentration of smaller sales amounts and a long tail of some few high-value transactions.This aligns with the median (USD 54.49) being much lower than the mean (USD 229.86), which is influenced by extreme values.

  • Significant outliers are visible, with sales exceeding USD 500. Extreme values, like the maximum sale of $22,638.48, are isolated instances and likely contribute disproportionately to overall revenue. The clustering of lower sales within the IQR range (USD 17.28–USD 209.94) is consistent with the analysis.

3.Total Profit:

  • The histogram reveals a cluster around 0, with transactions distributed across both positive and negative values. This indicates that a substantial number of transactions generate minimal or no profit.

  • The boxplot shows a high density near the mean, with outliers on both ends for substantial losses and gains. The minimum profit (USD -6,599.98) and the maximum profit (USD 8,399.98) are extreme cases visible in the boxplot and histogram.

4. Discount:

  • The histogram shows that the most frequent discount levels are is 0 %,that is, no discount (evolving 48% of transactions) and 20 % accounting for 37% of transactions. Other discount levels (like 10%, 15%, 30%, 40%) are far less common. This aligns with the mean (16%) and the IQR (0%–20%).

  • Discounts above 50% are rare and classified as outliers, with the maximum discount (80%) being an extreme case. The typical discounts are concentrated within the IQR, confirming moderate discounting practices.

In summary: the plots validate the descriptive statistics and would emphasize the importance of closer examination of the outliers in the variables, handle skewed distributions (if necessary),use appropriate metrics (e.g., medians) to represent central tendencies and investigate relationships between discount and profitability to uncover actionable insights.

2.2.7. Data cleaning¶

Data cleaning refers specifically to the process of identifying and correcting or removing errors, inconsistencies, or inaccuracies in the data. The goal is to improve the quality and reliability of the data.

2.2.7.1. Missing values, duplicates, data types and consistency¶

Based on the exploration phase, the dataset exhibits a high level of cleanliness. Below are the key observations:

  • No missing values: All columns are complete with no null or missing entries.
  • No duplicates: The dataset does not contain any duplicate rows.
  • Correct data types: All columns have appropriate data types, making the dataset ready for visualization and analysis.
  • Consistency: Data is recorded in a consistent manner, free from apparent errors.

2.2.7.2. Handling Outliers and Skewness¶

During the exploration phase, it was observed that allthe numerical variables (Quantity, Sales, Profit, and Discount) exhibit a significant number of outliers and varying levels and types of skewness. These affect the quality of visualizations (as we could see earlier) and can potentially distort certain types of analysis.

1) Outliers¶

    

Outliers can be addressed using various strategies, depending on the data's context and the analysis goals.

Common approaches include:

  • Removing Outliers: Excluding extreme values entirely to simplify the dataset and reduce noise.
  • Leaving Outliers Intact: Retaining them to preserve the dataset's full range of variability.
  • Applying Mathematical Transformations: Using techniques like logarithmic or square root transformations to reduce the impact of outliers.
  • Replacing Outliers: Substituting outliers with more representative values, such as the mean or median.
  • Separate Outlier Analysis: Treating outliers as a distinct group for independent examination, especially if they represent meaningful anomalies or rare cases.

2) Skewed Data¶

    

To improve the quality of visualizations (e.g., histograms, boxplots) and ensure more reliable statistical analysis, it is often beneficial to transform skewed data into a more symmetrical form.

Below are some common transformation methods:

  • Logarithmic Transformation
  • Square Root Transformation
  • Cube Root Transformation
  • Reciprocal Transformation
  • Exponential Transformation
  • Box-Cox Transformation
  • Quantile Transformation
  • Log Modulus Transformation
  • Yeo-Johnson Transformation

Key Considerations :

  • No single transformation method is universally ideal. Each has its strenghs and weaknesses.
  • The choice depends on the data context, analysis objectives, and the nature and degree of the skewness.

Decision for this project:

In this project, outliers and skewness are intentionally retained to emphasize comprehensive exploration (rather than statistical analysis and modeling) and a deeper understanding of overall trends.

This approach allows for a more holistic view of the dataset, ensuring that its natural variability is preserved. While removing outliers might simplify the dataset and reduce noise, it could also lead to the loss of insights derived from atypical but potentially important cases. Therefore, the dataset is preserved in its entirety to ensure a more thorough and nuanced analysis.

2.2.7.3. Removing irrelevant columns¶

df.drop(columns='Row ID',inplace=True) 

# Since it is just an identifier and thus not essential for analysis

2.2.7.4. Renaming columns¶

# In this analysis, we'll use the snake_case convention, where all letters are lowercase and Words are separated by underscores (e.g., order_date,customer_id).
# Other conventions, such as PascalCase (e.g., CustomerId) or camelCase (e.g., customerId), can also be used depending on the project’s standards or requirements.

df.rename(columns={
    'Order ID': 'order_id',
    'Order Date': 'order_date',
    'Ship Date': 'ship_date',
    'Ship Mode': 'ship_mode',
    'Customer ID': 'customer_id',
    'Customer Name': 'customer_name',
    'Postal Code': 'postal_code',
    'Product ID': 'product_id',
    'Sub-Category': 'sub_category',
    'Product Name': 'product_name',
    'Segment':'segment',
    'Region':'region',
    'Country/Region': 'country',               
    'City': 'city',               
    'State/Province': 'state',
    'Category':'category',
    'Sales': 'sales',      
    'Quantity': 'quantity',         
    'Discount':'discount',   
    'Profit':'profit'     
}, inplace=True)

print(df.columns)

# N.B: The other column names(Category, Region, Sales, Quantity,...) are also converted to lowercase for standardization purposes.                               

#--------------------------------------------------------- Note----------------------------------------------------------------------------------------
# In this specific context, we could also resort to basic string manipulations techniques to rename the columns in a faster manner:
# df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('-', '_')
# df.columns
# We first convert all column names to lowercase and then replace spaces  and hyphens  with underscores.
Index(['order_id', 'order_date', 'ship_date', 'ship_mode', 'customer_id',
       'customer_name', 'segment', 'country', 'city', 'state', 'postal_code',
       'region', 'product_id', 'category', 'sub_category', 'product_name',
       'sales', 'quantity', 'discount', 'profit'],
      dtype='object')

2.2.8. Data Transformation¶

Data transformation is the process of converting data from its original format or structure into a new format that is more suitable for analysis, modeling, or further processing. This transformation involves a variety of operations, such as scaling, encoding, aggregation, normalization, feature extraction, with the goal of improving data quality, enhancing visualizations, and meeting the assumptions of statistical models.

Examples of transformation include: Min-max scaling, z-score standardization, one-hot encoding, logarithmic transformation,...

# 2.2.8.1. Changing the discount column into percentage to make it more meaningful.

# This is an example of scaling transformation. Through this, the discount column has a more easily interpretable format.
df['discount'] = df['discount'] * 100
# 2.2.8.2. Extracting Date components (year, quarter, month, and day) from the order_date column

# This is a feature engineering transformation that enhances the dataset by creating new time-based features columns
# derived from the existing order_date column.

# To do that,we use  the "dt accessor"(while ensuring that the order_date is in the correct datetime type). 
df['year'] = df['order_date'].dt.year
df['quarter'] = df['order_date'].dt.quarter
df['month'] = df['order_date'].dt.month
df['day'] = df['order_date'].dt.day

A=df[['order_date','year','quarter','month','day']].head(5)
print(A,'\n')
print(A.dtypes) # Checking for data types
  order_date  year  quarter  month  day
0 2021-01-03  2021        1      1    3
1 2021-01-04  2021        1      1    4
2 2021-01-04  2021        1      1    4
3 2021-01-04  2021        1      1    4
4 2021-01-05  2021        1      1    5 

order_date    datetime64[ns]
year                   int32
quarter                int32
month                  int32
day                    int32
dtype: object

Observations:¶

We can observe that the extracted components (year, month, or day) are stored as integers. While this format works well for computational tasks, it can lead to visualization issues when using libraries like Seaborn or Matplotlib.

This is because these libraries frequently interpret integers as continuous variables, whereas components such as year or month are intended to represent discrete categories. This misinterpretation may cause:

  • Improper plot layouts: Charts might display unintended scaling or interpolation.
  • Axis labeling issues: Numeric axes can appear cluttered or incorrectly spaced, reducing readability.

To address this, it is often necessary to convert these components into strings or other appropriate data types, ensuring they are treated as categorical variables during visualization.

See the Visualization section.

# 2.2.8.3.Creating year_quarter and year_month columns using pd.PeriodIndex()/pd.PeriodIndex.from_fields()
 
# The year_quarter and year_month columns are also derived features. 
# Such feature engineering transformation enhances the dataset's utility.

df['year_quarter'] = pd.PeriodIndex.from_fields(year=df['year'], quarter=df['quarter'], freq='Q')
df['year_month'] = pd.PeriodIndex.from_fields(year=df['year'], month=df['month'], freq='M')

print(df[['year_quarter','year_month']])
      year_quarter year_month
0           2021Q1    2021-01
1           2021Q1    2021-01
2           2021Q1    2021-01
3           2021Q1    2021-01
4           2021Q1    2021-01
...            ...        ...
10188       2024Q4    2024-12
10189       2024Q4    2024-12
10190       2024Q4    2024-12
10191       2024Q4    2024-12
10192       2024Q4    2024-12

[9994 rows x 2 columns]

Observations:¶

We can note that the year_quarter and year_month columns already appear in the correct chronological order. So we do not need to sort.

2.2.9. Data Aggregation¶

Data aggregation is the process of summarizing detailed data into higher-level insights by grouping and calculating metrics. It is a crucial step for analyzing trends, identifying patterns, and generating actionable insights.

In this project, aggregation will enable us to explore sales, profit, and other key metrics across various dimensions such as time, region, and product categories.

Since the main focus of this project is on visualization, and the plots will rely heavily on aggregated data, we will perform the data aggregation directly in the Visualization section below. This approach ensures that the data is prepared in the format best suited for insightful visual representation.

¶

===============================================================================================================================

Section 3: Data Analysis and Visualizations

This section presents the detailed analysis and visualizations that support the objectives outlined in Section 1.

Each part focuses on different aspects of retail performance, providing insights that will guide strategic decision-making.

3.1. Overview of Key Metrics¶

Objective: Identify and present the critical metrics retained in this project to assess the performance of the retail business.

In this analysis, several key metrics (or key performance indicators) are used to gain a comprehensive understanding of the retail performance based on the Superstore dataset. These metrics help to evaluate sales, profitability, customer behavior, and order trends. Below is a description of the selected metrics, and their relevance to the analysis:

1. Total Quantity Sold:

  • Definition: The total number of products sold across all orders.
  • Relevance: This metric helps understand overall demand for products and is crucial for inventory management and supply chain optimization.

2. Total Sales:

  • Definition: The total revenue generated from all sales transactions.
  • Relevance: Sales are a fundamental metric for assessing revenue generation and the overall performance of the business.

3. Total Profit:

  • Definition: The net earnings after deducting costs from revenue.
  • Relevance: Profit is the key indicator of the company's financial success and efficiency in turning revenue into earnings.

4. Number of Customers:

  • Definition: The total number of unique customers who made a purchase.
  • Relevance: The number of customers helps evaluate the customer base and the effectiveness of customer acquisition strategies.

5. Number of Orders:

  • Definition: The total number of unique orders placed.
  • Relevance: This metric reflects the volume of transactions and helps in analyzing purchasing behavior and order frequency.

6. Average Discount:

  • Definition: The average discount percentage applied across all transactions.
  • Relevance: This helps assess how often discounts are applied and the impact of discounting on sales and profitability.

6. Profit Margin:

  • Definition: The percentage of sales that represents profit.
  • Relevance: Profit margin measures how efficiently the business is converting sales into profit, which is essential for assessing profitability.

7. Average Revenue per Customer:

  • Definition: The average amount of revenue generated from each customer.
  • Relevance: This metric is useful for understanding customer spending behavior and evaluating customer value.

8. Average Revenue per Order:

  • Definition: The average amount of revenue generated from each order.
  • Relevance: This helps assess the average transaction size and can indicate trends in customer spending behavior.

9. Average Profit per Customer:

  • Definition: The average profit generated per customer.
  • Relevance: Understanding average profit per customer allows businesses to measure customer profitability and identify high-value customers.

10. Average Number of Items per Order:

  • Definition: The average number of items purchased in each order.
  • Relevance: This metric reveals consumer purchasing patterns, helping to evaluate order sizes and inventory requirements.

11. Average Number of Orders per Customer:

  • Definition: The average number of orders placed by each customer.
  • Relevance: This metric helps in understanding customer loyalty and repeat purchase behavior.

3.2. Overall Metrics Performance Overview¶

Objective: Calculate and visualize the key metrics over the entire period to provide a holistic view of business performance.

# Calculate and Visualize the metrics:

# Step 1: Compute KPIs:

# a) Direct Metrics 
total_quantity_sold = df['quantity'].sum()
total_sales = df['sales'].sum()
total_profit= df['profit'].sum()
number_of_customers = df['customer_id'].nunique()  
number_of_orders = df['order_id'].nunique()       
average_discount= df['discount'].mean()

# b) Derived Metrics
profit_margin = (total_profit / total_sales)* 100
average_revenue_per_customer = (total_sales / number_of_customers)
average_revenue_per_order = (total_sales / number_of_orders)
average_profit_per_customer = (total_profit / number_of_customers)
average_number_items_per_order = (total_quantity_sold / number_of_orders)
average_number_orders_per_customer = (number_of_orders / number_of_customers)
#-------------------------------------------------------------------------------------------------------------------------------

# Step 2: Visualize KPIs Using Card Visuals

# Create a 3x4 grid for the KPI cards
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(25, 10))
fig.patch.set_facecolor('lightgrey')

# Add a title for the figure
fig.suptitle('Key Performance Indicators (KPIs)', fontsize=26, fontweight='bold', color='black')

# Define a list of KPIs with their respective labels, formats, and colors.
kpi = [
    (total_quantity_sold, 'Total Quantity Sold', axes[0, 0], "{:,.0f}", 'teal'),
    (total_sales, 'Total Sales', axes[0, 1], "${:,.2f}", 'blue'),
    (total_profit, 'Total Profit', axes[0, 2], "${:,.2f}", 'green'),
    (profit_margin, 'Profit Margin', axes[0, 3], "{:.2f}%", 'purple'),
    (average_discount, 'Average Discount', axes[1, 0], "{:.2f}%", 'brown'),
    (number_of_customers, 'Total Customers', axes[1, 1], "{:,.0f}", 'red'),
    (number_of_orders, 'Total Orders', axes[1, 2], "{:,.0f}", 'purple'),
    (average_number_orders_per_customer, 'Avg Orders/Customer', axes[1, 3], "{:,.0f}", 'orange'),
    (average_revenue_per_customer, 'Avg Revenue/Customer', axes[2, 0], "${:,.2f}", 'navy'),
    (average_revenue_per_order, 'Avg Revenue/Order', axes[2, 1], "${:,.2f}", 'darkred'),
    (average_profit_per_customer, 'Avg Profit/Customer', axes[2, 2], "${:,.2f}",'violet'),
    (average_number_items_per_order, 'Avg Items/Order', axes[2, 3], "{:,.0f}", 'lime')    
]
# Loop through KPIs and create a card for each in the grid
for value, label, ax, fmt, color in kpi:
    ax.text(x=0.5, y=0.5, s=f'{label}\n{fmt.format(value)}', horizontalalignment='center', verticalalignment='center',fontsize=22, 
            fontweight='bold', color=color,bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=1'))
    ax.axis('off')  # Hide axis for the card
    
# N.B: To center the text within each KPI card,we use  x=0.5, y=0.5, horizontalalignment='center' and verticalalignment='center'. 
#------------------------------------------------------------------------------------------------------------------------------ 
# Adjust layout for spacing
plt.tight_layout()

# Display the plot
plt.show()
#---------------------------------------------------------------Note-----------------------------------------------------------
#  We've defined a list of metrics (KPI values) and loop through it to avoid repetitive calls to ax.text() for each metric.
# This approach keeps the code concise and avoids redundancy. Alternatively, we could define each metric separately 
# and call ax.text() individually, but this would make the code longer and less maintainable.
No description has been provided for this image

3.3. Time Series Visualizations¶

Objective: Examine key performance metrics over different time intervals (yearly, quarterly, and monthly) to identify trends and seasonal patterns.

To achieve this analysis effectively, we will follow a structured three-step process:

1. Aggregate the Data: The first step is to group the data by the desired time periods (yearly, quarterly, or monthly) and aggregate the key performance metrics. This allows us to assess performance over time and capture trends at different intervals.

2. Calculate Descriptive Statistics: Once the data is aggregated, we will calculate descriptive statistics (mean, standard deviation, min, max, etc.) for each time period. This step helps in understanding the distribution, volatility, and central tendencies within the data.

3. Visualize the Data: Finally, we will visualize the aggregated and summarized data using appropriate charts (e.g., line plots, bar charts, heatmaps). Visualizations will help us clearly see trends, identify seasonality, and better interpret the data.

3.3.1. Yearly Analysis¶

Objective: Objective: Examine trends and fluctuations at a yearly level to understand their impact on sales, profitability, and overall business performance over time.

### a) Data Aggregation

# Aggregate data by Year
yearly_aggr = df.groupby('year', as_index=False).agg({
    'quantity': 'sum',
    'sales': 'sum',
    'profit': 'sum',
    'discount': 'mean',
    'order_id': 'nunique',
    'customer_id': 'nunique'
}).round(2)

# Rename the columns
yearly_aggr = yearly_aggr.rename(columns={
    'quantity': 'total_quantity_sold',
    'sales': 'total_sales',
    'profit': 'total_profit',
    'discount': 'average_discount',
    'order_id': 'number_of_orders',
    'customer_id': 'number_of_customers'
})

# Calculate new metrics and  add new columns to the yearly_aggr dataframe
yearly_aggr['profit_margin'] = ((yearly_aggr['total_profit'] / yearly_aggr['total_sales'])* 100)
yearly_aggr['average_revenue_per_customer'] = (yearly_aggr['total_sales'] / yearly_aggr['number_of_customers'])
yearly_aggr['average_revenue_per_order'] = (yearly_aggr['total_sales'] / yearly_aggr['number_of_orders'])
yearly_aggr['average_profit_per_customer'] = (yearly_aggr['total_profit'] / yearly_aggr['number_of_customers'])
yearly_aggr['average_profit_per_order'] = (yearly_aggr['total_profit'] / yearly_aggr['number_of_orders'])
yearly_aggr['average_number_orders_per_customer'] = (yearly_aggr['number_of_orders'] / yearly_aggr['number_of_customers'])


# Convert 'year' from int to string ( just to ensure it doesn't appear as an integer in the aggregated results and plots)
yearly_aggr['year'] = yearly_aggr['year'].astype(str)

# Display the aggregation results
yearly_aggr.round(2).head()
year total_quantity_sold total_sales total_profit average_discount number_of_orders number_of_customers profit_margin average_revenue_per_customer average_revenue_per_order average_profit_per_customer average_profit_per_order average_number_orders_per_customer
0 2021 7581 484247.50 49543.97 15.83 969 595 10.23 813.86 499.74 83.27 51.13 1.63
1 2022 7979 470532.51 61618.60 15.56 1038 573 13.10 821.17 453.31 107.54 59.36 1.81
2 2023 9837 609205.60 81795.17 15.47 1315 638 13.43 954.87 463.27 128.21 62.20 2.06
3 2024 12476 733215.26 93439.27 15.65 1687 693 12.74 1058.03 434.63 134.83 55.39 2.43
# b) Descriptive statistics of the aggregated data.
yearly_stats=yearly_aggr.describe()
yearly_stats.round(2)
total_quantity_sold total_sales total_profit average_discount number_of_orders number_of_customers profit_margin average_revenue_per_customer average_revenue_per_order average_profit_per_customer average_profit_per_order average_number_orders_per_customer
count 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00
mean 9468.25 574300.22 71599.25 15.63 1252.25 624.75 12.37 911.98 462.74 113.46 57.02 1.98
std 2233.24 122949.32 19723.02 0.15 326.13 52.90 1.46 116.97 27.38 23.24 4.82 0.35
min 7581.00 470532.51 49543.97 15.47 969.00 573.00 10.23 813.86 434.63 83.27 51.13 1.63
25% 7879.50 480818.75 58599.94 15.54 1020.75 589.50 12.12 819.35 448.64 101.47 54.32 1.77
50% 8908.00 546726.55 71706.88 15.60 1176.50 616.50 12.92 888.02 458.29 117.87 57.38 1.94
75% 10496.75 640208.02 84706.20 15.70 1408.00 651.75 13.18 980.66 472.39 129.86 60.07 2.15
max 12476.00 733215.26 93439.27 15.83 1687.00 693.00 13.43 1058.03 499.74 134.83 62.20 2.43

Some Insights:¶

Total Quantity Sold:

  • The yearly total quantity ranged from 7,581 to 12,476 units, with an average of 9,468.25 and a median of 8908.00 units sold annually. A standard deviation of 2,233.24 indicates moderate variability across year,reflecting fluctuating demand over time.

Total Sales:

  • The yearly sales ranged from USD 470,532.51 to USD 733,215.26, with an average of USD 574,300.22 and a median of USD
  • 546726.55. The standard deviation of USD 122,949.32 indicates high variability in yearly sales performance. Sales performance varies significantly year to year, possibly due to external factors such as seasonality, promotions, or market trends.

Total Profit:

  • The yearly profit ranged from USD 49,544.00 to USD 93,439.00, with an average of USD 71,599.00 and a median of USD 58599.94
  • The standard deviation of USD 19,723.00 indicates moderate variability in yearly profits.

Profit Margin:

  • Yearly profit margins ranged between 10.23% and 13.43%, with a mean of 12.38% and a standard deviation of 1.46%. Margins are relatively consistent but could be further optimized by reducing costs or increasing high-margin product sales.

Average Discount:

  • The yearly average discount ranged from 15.47% to 15.83%, with an average of 15.63%.
  • A standard deviation of 0.15% indicates consistent discount rates across years, reflecting a stable pricing strategy.

Number of Customers:

  • The number of unique customers per year ranged from 573 to 693, with an average of 625 customers annually.
  • The standard deviation of 53 customers suggests moderate variation in the customer base year over year. Customer acquisition appears stable, but efforts to expand the customer base could significantly boost revenue and profit.
# Visualizing Key Metric Trends over the Years

# Set Seaborn theme 
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16, "axes.titleweight": "bold", "axes.labelsize": 14, "axes.labelweight": "bold",
        "legend.fontsize": 11, "xtick.labelsize": 12, "ytick.labelsize": 12})

# Create a 3x2 grid of subplots 
fig, axes = plt.subplots(3, 2, figsize=(20, 15))

# Set background color for better contrast
fig.patch.set_facecolor('lightgrey')

# -----------------------------------------------------------------------------------------------------------------------------
## 1. Total Quantity Sold, Number of Customers and Orders 

# Bar plot for total quantity (primary y-axis)
sns.barplot(x='year', y='total_quantity_sold', data=yearly_aggr, color='steelblue', alpha=0.6, ax=axes[0, 0], label='Quantity Sold')

# Create a secondary y-axis for number of customers and orders
ax1_secondary = axes[0, 0].twinx()

# Overlay line plots for number of customers and number of orders
sns.lineplot(x='year', y='number_of_customers', data=yearly_aggr, color='orange', marker='o', ax=ax1_secondary, label='Customers')
sns.lineplot(x='year', y='number_of_orders', data=yearly_aggr, color='teal', marker='x', ax=ax1_secondary, label='Orders')

# Customize plot
axes[0, 0].set_title('Total Quantity Sold, Customers, and Orders by Year')
axes[0, 0].set_ylabel('Quantity Sold (Units)')
ax1_secondary.set_ylabel('Counts')
ax1_secondary.grid(False)  # Hide secondary y-axis grid
axes[0, 0].legend(loc='upper left')
ax1_secondary.legend(loc='upper left', bbox_to_anchor=(0, 0.9))

# -----------------------------------------------------------------------------------------------------------------------------
## 2. Total Sales and Profit with Profit Margin 

# Transform `yearly_aggr` into a long-format DataFrame for bar plotting
melted_df = yearly_aggr.melt(id_vars='year', value_vars=['total_sales', 'total_profit'], var_name='Metric', value_name='Value')

# Rename metric names for better readability
melted_df['Metric'] = melted_df['Metric'].replace({'total_sales': 'Sales', 'total_profit': 'Profit'})

# Bar plot for Total Sales and Profit (Primary Y-Axis)
sns.barplot(x='year', y='Value', hue='Metric', data=melted_df, palette=['blue', 'orange'], ax=axes[0, 1], alpha=0.85)

# Create a secondary y-axis for Profit Margin
ax2_secondary = axes[0, 1].twinx()

# Line plot for Profit Margin (Secondary Y-Axis)
sns.lineplot(x='year', y='profit_margin', data=yearly_aggr, color='red', marker='^', linestyle='--', ax=ax2_secondary, label='Profit Margin')

# Customize plot
axes[0, 1].set_title('Total Sales and Profit Margin by Year')
axes[0, 1].set_ylabel('Amount (USD)')
ax2_secondary.set_ylabel('Profit Margin (%)')
ax2_secondary.grid(False)
axes[0, 1].legend(loc='upper left')
ax2_secondary.legend(loc='upper left', bbox_to_anchor=(0, 0.85))

# -----------------------------------------------------------------------------------------------------------------------------
## 3. Average Revenue and Profit per Customer

# Line plots for revenue and profit per customer over time
sns.lineplot(x='year', y='average_revenue_per_customer', data=yearly_aggr, color='blue', marker='o', ax=axes[1, 0], label='Revenue/Customer')
sns.lineplot(x='year', y='average_profit_per_customer', data=yearly_aggr, color='green', marker='s', linestyle='--', ax=axes[1, 0], label='Profit/Customer')

# Customize plot
axes[1, 0].set_title('Average Revenue and Profit per Customer by Year')
axes[1, 0].set_ylabel('Amount (USD)')
axes[1, 0].legend(loc='upper left')

# -----------------------------------------------------------------------------------------------------------------------------
## 4. Average Revenue, Profit per Order and Number of Orders

# Line plots for revenue per order and profit per order
sns.lineplot(x='year', y='average_revenue_per_order', data=yearly_aggr, color='darkslateblue', marker='D', ax=axes[1, 1], label='Revenue/Order')
sns.lineplot(x='year', y='average_profit_per_order', data=yearly_aggr, color='darkolivegreen', marker='h', linestyle='--', ax=axes[1, 1], label='Profit/Order')

# Customize plot
axes[1, 1].set_title('Average Revenue and Profit per Order by Year')
axes[1, 1].set_ylabel('Amount (USD)')
axes[1, 1].legend(loc='upper right')

# -----------------------------------------------------------------------------------------------------------------------------
## 5. Average Number of Orders per Customer

# Bar plot to show the number of orders per customer
sns.barplot(x='year', y='average_number_orders_per_customer', data=yearly_aggr, color='limegreen', alpha=0.7, ax=axes[2, 0])

# Customize plot
axes[2, 0].set_title('Number of Orders per Customer by Year')
axes[2, 0].set_ylabel('Orders/Customer')

# -----------------------------------------------------------------------------------------------------------------------------
## 6. Profit Margin and Average Discount

# Line plots for Profit Margin and Average Discount
sns.lineplot(x='year', y='profit_margin', data=yearly_aggr, color='purple', marker='o', ax=axes[2, 1], label='Profit Margin')
sns.lineplot(x='year', y='average_discount', data=yearly_aggr, color='darkblue', marker='s', ax=axes[2, 1], label='Avg Discount')

# Customize plot
axes[2, 1].set_title('Profit Margin vs. Average Discount by Year')
axes[2, 1].set_ylabel('Percentage (%)')
axes[2, 1].legend(loc='best')

# -----------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flat:
    ax.grid(True, linestyle=':', alpha=0.7)  # Add a light dotted grid for readability
    ax.tick_params(axis='x', rotation=45)  # Rotate x-axis labels for better readability
    ax.set_xlabel('')  # Remove x-axis labels

# -----------------------------------------------------------------------------------------------------------------------------
# Add Data Labels to Bar Charts
for ax in axes.flat:  
    fmt = '%.2f' if ax == axes[2, 0] else '%.0f'  # Use decimal format only for specific plot
    for container in ax.containers:  
        ax.bar_label(container, fmt=fmt, color='black', fontweight='bold', fontsize=12, label_type='edge')

# -----------------------------------------------------------------------------------------------------------------------------
# Adjust Layout for Better Spacing
plt.tight_layout(pad=3)  

# Display the Plots
plt.show()
No description has been provided for this image

Key Metrics Over Years: Some Insights¶

1. Total Quantity Sold, Customer Base, and Order Volume:

  • Quantity Sold: The total quantity sold increased significantly and consistently over time, rising from 7,581 units in 2021 to 12,476 units in 2024—a remarkable growth of 64.57% over the period. This sustained growth reflects rising demand.

  • Order Volume: The total number of orders increased steadily, rising from 969 in 2021 to 1,687 in 2024, marking a significant 74.12% growth over the period.

  • Customer Base: The number of customers followed an upward trend, reaching its highest level of 693 customers in 2024, despite a slight dip in 2022. This growth reflects the store’s success in customer acquisition and retention.

The expansion of the customer base and order volume directly contributed to the growth in quantity sold, underscoring the importance of engaging new customers while maintaining repeat purchases.

2. Total Sales, Profit, and Profit Margin:

  • Total Sales: Sales demonstrated a clear upward trend, increasing from USD 484,247.50 in 2021 to USD 733,215.26 in 2024, reflecting a significant 51.41% % increase despite a dip in 2022. This growth is undoubtedly driven by higher order volumes (as observed above) and possibly enhanced by effective promotional strategies.
  • Total Profit: Profit increased in a significant and consistent manner, from USD 49,544 in 2021 to USD 93,439 in 2024, reflecting an 88.6% growth. Notably, despite a decline in sales in 2022, profit still increased by 24.4% during that year. This growth is likely attributed to a reduction in discounts (from 15.83% to 15.56%) or a stronger focus on high-margin products.
  • Profit Margin: The profit margin increased steadily until 2023, when it peaked at 13.43%, before declining to 12.74% in 2024—a year when both sales and profit reached their highest levels. These trends demonstrate the store’s capability to drive volume growth while maintaining profitability, likely through effective pricing strategies and cost management.

3. Average Revenue and Profit per Customer:

  • Revenue per Customer: The average revenue per customer steadily increased, reaching USD 1,058.03 in 2024, indicating higher spending per customer over time.

  • Profit per Customer: Profit per customer also rose consistently, hitting USD 134.83 in 2024, driven by growing customer value.

    This trend reflects the store’s success in increasing customer lifetime value (CLV) by encouraging higher spending and delivering value to loyal customers.

4. Average Revenue and Profit per Order:

  • Revenue per Order: Unlike revenue per customer,the revenue per order shows a declining trend over time, decreasing from USD 499.74 in 2021 to USD 463.27 in 2023, and further to USD 434.63 in 2024. This suggests that individual transaction sizes have decreased, potentially due to promotions or bundling strategies aimed at increasing order frequency.
  • Profit per Order: Despite the declining trend in revenue per order, profit per order has shown an upward trajectory, reflecting improved efficiency and profitability. However, there was a slight decline in 2024 compared to 2023, which may signal a need to evaluate recent cost and pricing strategies.

5. Number of Orders per Customer:

  • The number of orders per customer increased steadily over the years, rising from 1.6 in 2021 to 2.4 in 2024. This rise in order frequency reflects stronger customer relationships and suggests growing satisfaction and trust in the store’s offerings.

6. Profit Margin and Discount:

  • Discounts: Discounts ranged from 15.47% to 15.83% and showed a slight downward trend across the years,despite a slight increase in 2024.
  • As mentioned earlier, profit margins increased steadily until peaking in 2023, before declining in 2024. It appears there is a negative relationship between discount levels and profit margin. In fact, the highest profit margins correspond to the lowest discount levels and vice versa.

This highlights the critical role of discount optimization in enhancing profitability while maintaining strong sales performance.

3.3.2. Quarterly Analysis¶

Objective: Examine trends, fluctuations, and seasonal patterns at a quarterly level to understand their impact on sales, profitability, and overall business performance over time.

# a) Data Aggregation

# aggregate data by year_quarter
quarterly_aggr = df.groupby(['year_quarter'], as_index=False).agg({
    'quantity': 'sum',
    'sales': 'sum',
    'profit': 'sum',
    'discount': 'mean',
    'order_id': 'nunique',
    'customer_id': 'nunique'
}).round(2)

# Rename the columns for clarity
quarterly_aggr = quarterly_aggr.rename(columns={
    'quantity': 'total_quantity_sold',
    'sales': 'total_sales',
    'profit': 'total_profit',
    'discount': 'average_discount',
    'order_id': 'number_of_orders',
    'customer_id': 'number_of_customers'
})

# Add new metrics and new columns to the quarterly_aggr dataframe
quarterly_aggr['profit_margin'] = ((quarterly_aggr['total_profit'] / quarterly_aggr['total_sales'])* 100)
quarterly_aggr['average_revenue_per_customer'] = (quarterly_aggr['total_sales'] / quarterly_aggr['number_of_customers'])
quarterly_aggr['average_revenue_per_order'] = (quarterly_aggr['total_sales'] / quarterly_aggr['number_of_orders'])
quarterly_aggr['average_profit_per_customer'] = (quarterly_aggr['total_profit'] / quarterly_aggr['number_of_customers'])
quarterly_aggr['average_profit_per_order'] = (quarterly_aggr['total_profit'] / quarterly_aggr['number_of_orders']).round(2)
quarterly_aggr['average_number_orders_per_customer'] = (quarterly_aggr['number_of_orders'] / quarterly_aggr['number_of_customers'])

# Convert 'year_aggr' column from period to string (to ensure compatibility with plotting)
# since  period data type is not well supported in Seaborn/Matplotlib when plotting.
quarterly_aggr['year_quarter'] = quarterly_aggr['year_quarter'].astype(str)

# print
quarterly_aggr.round(2).head()
year_quarter total_quantity_sold total_sales total_profit average_discount number_of_orders number_of_customers profit_margin average_revenue_per_customer average_revenue_per_order average_profit_per_customer average_profit_per_order average_number_orders_per_customer
0 2021Q1 1028 74447.80 3811.23 15.74 131 121 5.12 615.27 568.30 31.50 29.09 1.08
1 2021Q2 1523 86538.76 11204.07 14.55 201 180 12.95 480.77 430.54 62.24 55.74 1.12
2 2021Q3 2159 143633.21 12804.72 15.51 267 228 8.91 629.97 537.95 56.16 47.96 1.17
3 2021Q4 2871 179627.73 21723.95 16.76 370 305 12.09 588.94 485.48 71.23 58.71 1.21
4 2022Q1 990 68851.74 9264.94 14.60 144 135 13.46 510.01 478.14 68.63 64.34 1.07
# b) Descriptive statistics of the aggregated data.
quarterly_stats=quarterly_aggr.describe()
quarterly_stats.round(2)
total_quantity_sold total_sales total_profit average_discount number_of_orders number_of_customers profit_margin average_revenue_per_customer average_revenue_per_order average_profit_per_customer average_profit_per_order average_number_orders_per_customer
count 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00 16.00
mean 2367.06 143575.05 17899.81 15.59 313.06 252.25 12.35 561.64 467.06 69.14 57.46 1.21
std 1019.58 59229.73 8525.62 0.88 134.94 85.84 3.04 66.25 55.53 19.26 16.20 0.11
min 990.00 68851.74 3811.23 14.46 131.00 121.00 5.12 464.19 364.48 31.50 29.09 1.07
25% 1583.75 92208.93 12003.53 15.01 210.75 189.00 11.44 502.70 436.18 60.72 46.83 1.11
50% 2240.50 134923.34 16106.97 15.38 284.00 236.50 12.53 569.59 469.13 68.46 57.19 1.19
75% 2939.25 180295.05 23358.39 16.08 385.00 306.25 13.51 595.53 509.90 73.05 62.11 1.28
max 4696.00 280054.07 38139.86 17.17 632.00 436.00 19.09 684.34 568.30 110.55 97.94 1.45

Some Insights¶

Total Profit: The lowest quarterly profit was USD 3,811, and the highest was USD 38,140, with an average of USD 17900 and a standard deviation of USD 8526.

Profit Margin: The profit margins varied from 5.12 % to 19.09 %, with a mean of 12.35 % and standard deviation of 3.04 %. Number of Customers: The number of unique customers per quarter ranged from 121 to 436,with a mean of 252 and standard deviation of 86.

# c) Visualizing Key Metric Trends over the Quarters

# Set the seaborn style 
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a figure with 3x2 subplots
fig, axes = plt.subplots(3, 2, figsize=(20, 15))

# Set figure background color
fig.patch.set_facecolor('lightgrey') 
#-----------------------------------------------------------------------------------------------------------------------------------
## 1. Total Quantity, Number of Customers, and Orders

# Create bar plot for Total Quantity
sns.barplot(x='year_quarter', y='total_quantity_sold', data=quarterly_aggr, color='steelblue', alpha=0.6, ax=axes[0, 0],label='Quantity Sold')

# create line plots for Number of Customers and Orders
ax1_secondary = axes[0, 0].twinx()
sns.lineplot(x='year_quarter', y='number_of_customers', data=quarterly_aggr, color='orange', marker='o', ax=ax1_secondary,label='Customers')
sns.lineplot(x='year_quarter', y='number_of_orders', data=quarterly_aggr, color='teal', marker='x', ax=ax1_secondary,label='Orders')

# Customizations
axes[0, 0].set_title('Total Quantity Sold, Customers, and Orders by Quarter')
axes[0, 0].set_ylabel('Quantity Sold (Units)')
ax1_secondary.set_ylabel('Counts')
ax1_secondary.grid(False)  # Turn off secondary grid
axes[0, 0].legend(loc='upper left')
ax1_secondary.legend(loc='upper left',bbox_to_anchor=(0, 0.9))
#---------------------------------------------------------------------------------------------------------------------------
# 2. Total Sales and Profit with Profit Margin

# Create side-by-sidebar plots for Total Sales and Profit
melted_df = quarterly_aggr.melt(id_vars='year_quarter', value_vars=['total_sales', 'total_profit'],var_name='Metric', value_name='Value')
melted_df['Metric'] = melted_df['Metric'].replace({'total_sales': 'Sales', 'total_profit': 'Profit'})
sns.barplot(x='year_quarter', y='Value', hue='Metric', data=melted_df, palette=['blue', 'orange'], ax=axes[0, 1], alpha=0.85)

# Create  a line plotfor profit margin  on the secondary axis
ax2_secondary = axes[0, 1].twinx()
sns.lineplot(x='year_quarter', y='profit_margin', data=quarterly_aggr, color='red', marker='^', linestyle='--', ax=ax2_secondary,label='Profit Margin')

# Customizations
axes[0, 1].set_title('Total Sales and Profit Margin by Quarter')
axes[0, 1].set_ylabel('Amount (USD)')
ax2_secondary.set_ylabel('Profit Margin (%)')
ax2_secondary.grid(False)
axes[0, 1].legend()
ax2_secondary.legend(loc='upper left',bbox_to_anchor=(0, 0.85))
#---------------------------------------------------------------------------------------------------------------------------------
# 3. Average Revenue and Profit per Customer
sns.lineplot(x='year_quarter', y='average_revenue_per_customer', data=quarterly_aggr, color='blue', marker='o', ax=axes[1, 0], label='Revenue/Customer')
sns.lineplot(x='year_quarter', y='average_profit_per_customer', data=quarterly_aggr, color='green', marker='s', linestyle='--', ax=axes[1, 0], label='Profit/Customer')
axes[1, 0].set_title('Average Revenue and Profit per Customer by Quarter')
axes[1, 0].set_ylabel('Amount (USD)')
axes[1, 0].legend(loc='upper right')
#-------------------------------------------------------------------------------------------------------------------------------
# 4. Average Revenue and Profit per Order
sns.lineplot(x='year_quarter', y='average_revenue_per_order', data=quarterly_aggr, color='darkslateblue', marker='D', ax=axes[1, 1], label='Revenue/Order')
sns.lineplot(x='year_quarter', y='average_profit_per_order', data=quarterly_aggr, color='darkolivegreen', marker='h', linestyle='--', ax=axes[1, 1], label='Profit/Order')
axes[1, 1].set_title('Average Revenue and Profit per Order by Quarter')
axes[1, 1].set_ylabel('Amount (USD)')
axes[1, 1].legend(loc='best')
#------------------------------------------------------------------------------------------------------------------------------
# 5. Number of Orders per Customer
sns.barplot(x='year_quarter', y='average_number_orders_per_customer', data=quarterly_aggr, color='limegreen', alpha=0.7, ax=axes[2, 0])
axes[2, 0].set_title('Number of Orders per Customer by Quarter')
axes[2, 0].set_ylabel('Orders/Customer')
#---------------------------------------------------------------------------------------------------------------------------------
# 6. Profit Margin and Average Discount
sns.lineplot(x='year_quarter', y='profit_margin', data=quarterly_aggr, color='purple', marker='o', ax=axes[2, 1], label='Profit Margin')
sns.lineplot(x='year_quarter', y='average_discount', data=quarterly_aggr, color='darkblue', marker='s', ax=axes[2, 1], label='Avg Discount')
axes[2, 1].set_title('Profit Margin vs. Average Discount by Quarter')
axes[2, 1].set_ylabel('Percentage (%)')
#--------------------------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flat:
    ax.grid(True, linestyle=':', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)
    ax.set_xlabel('')
#-------------------------------------------------------------------------------------------------------------------------------
# Layout adjustments
plt.tight_layout(pad=3)
plt.show()
No description has been provided for this image

Trends and Insights on Key Metrics Over the Quarters¶

1. Total Quantity Sold, Customer Base, and Order Volume:

The quarterly analysis reveals that the three metrics—Total Quantity Sold, Customer Base, and Order Volume — follow similar trends, characterized by :

  • A general upward trend, indicating consistent growth in customer acquisition and sales activity.
  • A strong seasonal pattern, with pronounced peaks in Q4 every year (e.g., 2021Q4, 2022Q4, etc.), followed by significant declines in Q1 of the subsequent year (e.g., 2022Q1, 2023Q1). This suggests heightened demand during holiday seasons, likely driven by promotional campaigns and increased consumer spending.

We can also observe that Order volumes show sharper increases in certain quarters compared to the number of customers, indicating that some customers place multiple orders during peak periods. This could be due to seasonal sales or repeat purchases during holidays. The seasonal and growth patterns underscore the importance of aligning inventory management and marketing efforts with peak demand periods to maximize revenue and customer satisfaction.

2. Total Sales, Profit, and Profit Margin:

  • Total Sales and Profit:
    These metrics closely mirror the pattern of quantity sold, with an upward trend and clear seasonal peaks in Q4 annually. The highest profit peak occurs in 2023Q4, suggesting an exceptionally profitable holiday season driven by strong demand.

  • Profit Margin:
    Profit margins fluctuate between 5.12% and 19.09%, with no clear long-term upward trend. The highest margin is observed in 2024Q1, despite lower sales, potentially due to higher-margin product sales or operational efficiencies. The lowest margin, in 2021Q1, likely reflects pricing or operational challenges during the business’s initial quarters.

  • Trade-off Between Sales and Margins: While Q4 consistently drives the highest sales and profits, it does not always result in the highest profit margins. This discrepancy could stem from higher discounts or increased operational costs during these periods. To balance revenue growth and profitability, the company should optimize pricing strategies and operational efficiency, particularly during Q4.

3. Average Revenue and Profit per Customer:

  • Revenue per Customer: Despite an increasing customer base and rising sales, the average revenue per customer remains relatively stable around USD 550 across most quarters. Seasonal patterns are evident, with dips typically in Q2 (lower revenue per customer) and peaks in Q4 (higher revenue per customer) in most years. While Q4 peaks align with increases in both sales revenue and the customer base, the dips in revenue per customer do not consistently coincide with Q1, which has the lowest sales and customer numbers annually. For example: This mismatch suggests that factors beyond overall sales or customer volume—such as customer purchasing behavior or product mix—play a role in determining revenue per customer, particularly in Q1.

  • Profit per Customer: Profit per Customer: Profit per customer shows no clear trend or seasonal pattern, fluctuating but remaining relatively low compared to revenue per customer. For example, in 2023Q4, the average revenue per customer was USD 684.34, while the profit per customer was USD 110.55—a difference of USD 573.79, reflecting high costs or discounts. There is potential to optimize cost structures or pricing strategies to enhance profitability per customer, especially during peak quarters.

4. Average Revenue and Profit per Order:

  • Similar to customer metrics, there is no clear upward trend in revenue or profit per order.
  • Revenue per order peaks in some Q4 periods but not consistently. However, dips are commonly observed in Q2 of each year.
  • Profit per order shows less variation than revenue, but it remains relatively low, reinforcing the need for cost optimization or price adjustments.

5. Number of Orders per Customer:

  • This metric closely aligns with the pattern of quantity sold, showing an slight upward trend and distinct seasonal peaks in Q4 each year, followed by declines in Q1.
  • The consistent growth in orders per customer indicates increasing engagement, likely due to targeted marketing or repeat purchases.
  • Capitalizing on repeat purchases through loyalty programs or personalized offers could enhance customer lifetime value.

6. Profit Margin and Discount:

  • Discounts: Discounts ranged from 14.46% to 17.17%, with notable peaks in Q4 2021, Q2 2022, and Q2 2024, and dips in Q4 2023 and Q3 2024.

  • Profit Margin: As we saw above, Profit margins fluctuated across quarters, with Peaks in Q1 2024, Q4 2023, and Q2 2022

  • Relationship Between Discount and Profit Margin: A clear inverse relationship is evident, where higher discounts coincide with lower profit margins. For instance, Q4 2023, which had the lowest average discount, also recorded one of the highest profit margins for the year. Strategic discount management is critical to maintaining profitability while driving sales volume during seasonal peaks.

3.3.3. Monthly Analysis¶

Objective: Objective: Examine trends, fluctuations, and seasonal patterns at a monthly level to understand their impact on sales, profitability, and overall business performance over time.

# a)  Data Aggregation

# Aggregate data by year_month
monthly_aggr = df.groupby(['year_month'],as_index=False).agg({
    'quantity':'sum',
    'sales': 'sum',
    'profit': 'sum',
    'discount': 'mean',
    'order_id': 'nunique',
    'customer_id': 'nunique'
}).round(2)

# Rename the  columns (to improve clarity)
monthly_aggr= monthly_aggr.rename(columns={'quantity':'total_quantity_sold','sales':'total_sales','profit':'total_profit',
                                           'discount':'average_discount',
                                           'order_id': 'number_of_orders','customer_id': 'number_of_customers'})

# Add new metrics and new columns to  the monthly_aggr dataframe

monthly_aggr['profit_margin'] = ((monthly_aggr['total_profit'] / monthly_aggr['total_sales'])* 100)
monthly_aggr['average_revenue_per_customer'] = (monthly_aggr['total_sales'] / monthly_aggr['number_of_customers'])
monthly_aggr['average_revenue_per_order'] = (monthly_aggr['total_sales'] / monthly_aggr['number_of_orders'])
monthly_aggr['average_profit_per_customer'] = (monthly_aggr['total_profit'] / monthly_aggr['number_of_customers'])
monthly_aggr['average_profit_per_order'] = (monthly_aggr['total_profit'] / monthly_aggr['number_of_orders'])
monthly_aggr['average_number_orders_per_customer'] = (monthly_aggr['number_of_orders'] / monthly_aggr['number_of_customers'])

# Convert 'month_aggr' column from period to string (to ensure compatibility with plotting)
monthly_aggr['year_month'] = monthly_aggr['year_month'].astype(str)

# Display the aggregation results
monthly_aggr.round(2).head()
year_month total_quantity_sold total_sales total_profit average_discount number_of_orders number_of_customers profit_margin average_revenue_per_customer average_revenue_per_order average_profit_per_customer average_profit_per_order average_number_orders_per_customer
0 2021-01 284 14236.90 2450.19 12.66 32 32 17.21 444.90 444.90 76.57 76.57 1.00
1 2021-02 159 4519.89 862.31 17.61 28 27 19.08 167.40 161.42 31.94 30.80 1.04
2 2021-03 585 55691.01 498.73 16.75 71 69 0.90 807.12 784.38 7.23 7.02 1.03
3 2021-04 536 28295.34 3488.84 11.00 66 64 12.33 442.11 428.72 54.51 52.86 1.03
4 2021-05 466 23648.29 2738.71 15.53 69 67 11.58 352.96 342.73 40.88 39.69 1.03
# b) Descriptive statistics of the aggregated data.
monthly_stats=monthly_aggr.describe()
monthly_stats.round(2)
total_quantity_sold total_sales total_profit average_discount number_of_orders number_of_customers profit_margin average_revenue_per_customer average_revenue_per_order average_profit_per_customer average_profit_per_order average_number_orders_per_customer
count 48.00 48.00 48.00 48.00 48.00 48.00 48.00 48.00 48.00 48.00 48.00 48.00
mean 789.02 47858.35 5966.61 15.63 104.35 96.23 12.47 489.16 457.98 59.47 55.58 1.07
std 401.90 25195.89 4288.70 2.11 53.78 45.61 7.41 112.74 104.58 42.24 39.94 0.05
min 159.00 4519.89 -3281.01 9.64 28.00 27.00 -18.05 167.40 161.42 -117.18 -113.14 1.00
25% 541.25 29790.10 2939.56 14.08 68.00 66.50 10.17 418.81 387.30 44.52 39.24 1.03
50% 664.00 39803.24 4990.55 15.86 89.50 84.50 12.61 474.08 440.37 58.77 55.96 1.07
75% 1019.75 65833.34 8742.50 17.19 134.75 120.00 15.50 553.27 522.73 76.97 75.69 1.09
max 1840.00 118447.82 17885.31 19.44 261.00 216.00 27.21 807.12 784.38 170.98 154.70 1.21
       

Some Insights:¶

Total Profit: The lowest monthly profit occurred in January 2022, amounting to a loss of USD3,281. and the highest is USD17885 (December 2023),with an average amount of USD5966, a median level of USD4990 and a standard deviation of USD4288 (very high variation);

Profit Margin: The monthly profit margin lies between -18.05% and 27.21%, with a mean of 12.47% and standard deviation of 7.41%

Number of Customers: The lowest monthly number of Customers over the period is 27, the greatest number is 216, with a mean of 96 and a median of 85.

# c) Visualizing Key Metric Trends over the Months

from matplotlib.ticker import MaxNLocator

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a figure with 3x2 subplots
fig, axes = plt.subplots(3, 2, figsize=(18, 15))

# Set figure background color to 'lightgrey'
fig.patch.set_facecolor('lightgrey')
#--------------------------------------------------------------------------------------------------------------------------
## 1. Total Quantity, Number of Customers, and Orders with dual y-axis

# Create  bar plots for  Total Quantity
sns.barplot(x='year_month', y='total_quantity_sold', data=monthly_aggr, color='steelblue', alpha=0.6, ax=axes[0, 0],label='Quantity Sold')

# create line plots for Number of Customers and Orders
ax1_secondary = axes[0, 0].twinx()
sns.lineplot(x='year_month', y='number_of_customers', data=monthly_aggr, color='orange', marker='o', ax=ax1_secondary,label='Customers')
sns.lineplot(x='year_month', y='number_of_orders', data=monthly_aggr, color='teal', marker='x', ax=ax1_secondary,label='Orders')

# Customizations
axes[0, 0].set_title('Total Quantity Sold, Customers and Orders by Month')
axes[0, 0].set_ylabel('Quantity Sold (Units)')
ax1_secondary.set_ylabel('Counts')
ax1_secondary.grid(False)
axes[0, 0].legend(loc='upper left')
ax1_secondary.legend(loc='upper left',bbox_to_anchor=(0, 0.92))
#------------------------------------------------------------------------------------------------------------------------------
## 2. Total Sales, Profit and Profit Margin

# Create side-by-side bar plots for Total Sales and Profit
melted_df = monthly_aggr.melt(id_vars='year_month', value_vars=['total_sales', 'total_profit'], var_name='Metric', value_name='Value')
melted_df['Metric'] = melted_df['Metric'].replace({'total_sales': 'Sales', 'total_profit': 'Profit'})
sns.barplot(x='year_month', y='Value', hue='Metric', data=melted_df, palette=['blue', 'orange'], ax=axes[0, 1], alpha=0.85)

# Create line plot for profit margin on the secondary axis
ax2_secondary = axes[0, 1].twinx()
sns.lineplot(x='year_month', y='profit_margin', data=monthly_aggr, color='red', marker='^', linestyle='--', ax=ax2_secondary,label='Profit Margin')

# Customizations:
axes[0, 1].set_title('Total Sales,Profit and Profit Margin by Month')
axes[0, 1].set_ylabel('Amount (USD)')
ax2_secondary.set_ylabel('Profit Margin (%)')
ax2_secondary.grid(False)
axes[0, 1].legend()
ax2_secondary.legend(loc='upper right')
#---------------------------------------------------------------------------------------------------------------------------------
## 3. Average Revenue and Profit per Customer
sns.lineplot(x='year_month', y='average_revenue_per_customer', data=monthly_aggr, color='blue', marker='o', ax=axes[1, 0], label='Revenue/Customer')
sns.lineplot(x='year_month', y='average_profit_per_customer', data=monthly_aggr, color='green', marker='s', linestyle='--', ax=axes[1, 0], label='Profit/Customer')
axes[1, 0].set_title('Average Revenue and Profit per Customer by Month')
axes[1, 0].set_ylabel('Amount (USD)')
axes[1, 0].legend(loc='upper right')
#-------------------------------------------------------------------------------------------------------------------------------
## 4. Average Revenue and Profit per Order
sns.lineplot(x='year_month', y='average_revenue_per_order', data=monthly_aggr, color='darkslateblue', marker='D', ax=axes[1, 1], label='Revenue/Order')
sns.lineplot(x='year_month', y='average_profit_per_order', data=monthly_aggr, color='darkolivegreen', marker='h', linestyle='--', ax=axes[1, 1], label='Profit/Order')
axes[1, 1].set_title('Average Revenue and Profit per Order by Month')
axes[1, 1].set_ylabel('Amount (USD)')
axes[1, 1].legend(loc='upper right')
#------------------------------------------------------------------------------------------------------------------------------
## 5. Average Number of Orders 
sns.barplot(x='year_month', y='average_number_orders_per_customer', data=monthly_aggr, color='limegreen', alpha=0.7, ax=axes[2, 0])
axes[2, 0].set_title('Number of Orders per Customer by Month')
axes[2, 0].set_ylabel('Orders/Customer')
#--------------------------------------------------------------------------------------------------------------------------------
## 6. Profit Margin and Average Discount
sns.lineplot(x='year_month', y='profit_margin', data=monthly_aggr, color='purple', marker='o', ax=axes[2, 1], label='Profit Margin')
sns.lineplot(x='year_month', y='average_discount', data=monthly_aggr, color='darkblue', marker='s', ax=axes[2, 1], label='Avg Discount')
axes[2, 1].set_title('Profit Margin vs. Average Discount by Month')
axes[2, 1].set_ylabel('Percentage (%)')
#---------------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flatten():
    ax.xaxis.set_major_locator(MaxNLocator(nbins=12))  # Adjusting number of ticks 
    ax.grid(True, linestyle=':', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)
    ax.set_xlabel('')

# Adjust layout to avoid overlap
plt.tight_layout(pad=3)
plt.show()
No description has been provided for this image

Trends and Insights on Key Metrics Over the Months¶

1.Total Quantity Sold, Customer Base, and Order Volume:

As observed in the quarterly analysis, the three metrics evolve in a similar manner on a monthly basis, highlighting:

  • A clear upward trend, reflecting improving business performance across the years.
  • Month-to-month fluctuations, and recurring seasonal activity patterns each year with :
    • 1. Peak Activity Periods: The highest activity levels occur in the following months: November (highest peak), December,September (to a lesser extent). Interestingly, there is a dip in October, breaking the momentum between September and November.
    • 2. Low Activity Periods: January and February are consistently the months with the lowest activity levels across all metrics.
    • 3. Stable Activity Periods: The period from March to August shows relatively stable activity with minor fluctuations, suggesting regular business operations without major seasonal influences.

2. Total Sales, Profit and Profit Margin:

  • Sales and profit: Sales and profit follow an almost identical trajectory to total quantity sold, with a clear upward trend and notable spikes in September, November, and December of each year. These increases reflect the impact of seasonal demand and holiday shopping trends, which typically drive higher consumer spending during these months.

  • Two months show negative profits (losses): July 2021 and January 2022, highlighting operational challenges during these periods.These losses are visibly due to low sales volumes, coupled with relatively high or fixed operational costs, creating a scenario where revenue was insufficient to cover expenses. Consequently, all profit-related ratios, such as profit margin, profit per customer, and profit per order, also exhibit negative results, further emphasizing the financial strain during these months.

  • Profit Margin:

    • The profit margin demonstrates significant volatility across months, marked by both sharp peaks and deep troughs. The highest profit margins are observed in October 2023 (27.21%), March 2022 (25.13%), March 2024 (25.06%), and February 2022 (23.54%). These months likely benefited from favorable pricing strategies or a product mix with higher margins.
    • The months with the lowest profit margins are January 2022 (-18.05%), July 2021 (-2.48%), March 2021 (0.90%), April 2024 (2.54%), and November 2023 (5.05%). These periods highlight challenges such as low sales volumes, higher operational costs, or an unfavorable product mix, which significantly eroded profitability during these months.
    • Similar to quarterly trends, high-sales months (September, November, and December) do not always correspond with high-profit margins. This may be due to factors such as increased discounts, promotional activities, or higher operational costs during peak demand periods, which can erode profit margins despite higher sales volumes.

3. Average Revenue and Profit per Customer:

  • Revenue per Customer: The monthly analysis reveals no clear upward trend or consistent seasonal pattern in revenue per customer. Instead, revenue per customer shows notable fluctuations across months, indicating variability in customer spending behavior. The months with the highest revenue per customer are March 2021 (USD 807.12), September 2021(693.03),and January 2023 (USD 656.29).
  • Average Profit per Customer: Profit per customer exhibits less variability compared to revenue per customer and reamins significantly lower in magnitude. The highest profit per customer is observed in October 2023 (USD 170.98) and March 2024 (USD 128.28). The months with the lowest profit per customer are July 2021 (USD -117.18), January 2022 (USD -12.95), March 2021 (USD 7.23), and April 2024 (USD 8.56).
    We can observe that peaks in revenue and profit per customer do not align with months experiencing the highest customer volumes (e.g., September, November, and December).This suggests that higher revenues are often driven by a smaller group of high-spending customers, rather than an increase in the number of customers. Such patterns emphasize the importance of targeting high-value customers to maximize profitability.

4. Average Revenue and Profit per Order:

  • These metrics follow a similar pattern to revenue and profit per customer, with notable fluctuations across months but no clear upward trend or consistent seasonal pattern. Like Profit per Customer, Profit per order shows less variation and remains relatively low (mostly below USD 80) compared to revenue per order. This analysis reinforces the need for cost structure improvements and optimized pricing to boost per-order profitability.

5. Number of Orders per Customer:
The number of orders per customer fluctuates within the range of 1 to 1.21, showing a slight upward trend over time. Seasonal patterns are evident, with peaks in September, October, November, and December each year, followed by a decrease at the start of the new year. To capitalize on these trends, businesses could focus on encouraging repeat purchases through loyalty programs or seasonal promotions, particularly during peak months.

6. Discount and Profit Margin:

  • Discount: While discount levels fluctuate month-to-month, they remains relatively stable over the period, with no clear upward or downward trend, indicating a consistent discounting strategy. Months with the highest discount levels include: April 2024 (19.44%), March 2023 (19.09%), December 2021 (19.08%).These peaks may reflect special campaigns, holiday promotions, or inventory clearance efforts. Months with the lowest discount levels include: February 2023 (9.64%), April 2021 (11%). January 2021 (12.66%). These months indicate a more conservative approach to discounting, potentially driven by steady demand or reduced need for promotional incentives.
  • Relationship between the metrics : There appears to be an inverse relationship between average discount and profit margin. Higher discounts often coincide with lower profit margins, as larger discounts reduce per-unit profitability.Conversely, lower discounts are often associated with higher profit margins, suggesting that profitability is sustained during periods of reduced promotional activity.

This trend underscores the trade-off between discounting and profitability—while discounts may drive sales volume, they can erode profit margins if not balanced effectively.


3.3.4 Performance Analysis by Segment¶

Objective: Evaluate product, customer, and regional performance to understand key drivers of sales and profitability. This analysis aims to:

  • Highlight the most profitable and high-performing segments.
  • Understand areas that need attention or improvement.
  • Provide actionable insights for strategic decision-making.

3.3.4.1. Product Performance Analysis¶

Objective: Analyze the performance of products to identify top-performing categories, subcategories, and individual products.

This analysis will be conducted in two steps for each level: data aggregation and data visualization to systematically derive insights from the data.

3.3.4.1.1. Category Analysis¶
  • 1) Category Analysis over the Entire Period¶
# a) Aggregate metrics at the category level for the entire period

df_cat = df.groupby('category').agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean') 
).reset_index()

# Calculate percentage contribution to total sales
df_cat['sales_share'] = (df_cat['total_sales'] / df_cat['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_cat['profit_share'] = (df_cat['total_profit'] / df_cat['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_cat['profit_margin'] = (df_cat['total_profit'] / df_cat['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_cat = df_cat.sort_values(by='total_sales', ascending=False, ignore_index=True)

# Display the result
df_cat.round(2)
category total_quantity_sold total_sales total_profit average_discount sales_share profit_share profit_margin
0 Technology 6939 836154.03 145454.95 13.23 36.4 50.79 17.40
1 Furniture 8028 741999.80 18451.27 17.39 32.3 6.44 2.49
2 Office Supplies 22906 719047.03 122490.80 15.73 31.3 42.77 17.04
# b) Visualizing Key Metrics by Category over the Entire Period

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a 1x2 grid of subplots 
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Set background color for the entire figure
fig.patch.set_facecolor('lightgrey')
#-------------------------------------------------------------------------------------------------------------------------------
### 1. Total Sales, Total Profit & Total Quantity by Category

# Create side-by-side bar plots for Total sales and Profit
melted_cat = df_cat.melt(id_vars='category', value_vars=['total_sales', 'total_profit'], var_name='Metric', value_name='Value')
melted_cat['Metric'] = melted_cat['Metric'].replace({'total_sales': 'Sales','total_profit': 'Profit'})
sns.barplot( x='category',  y='Value', hue='Metric', data=melted_cat, palette=['blue', 'orange'], ax=axes[0], alpha=0.85)

# Create a line plot for Total Quantity on the secondary axis 
ax2_secondary = axes[0].twinx()  
sns.lineplot(x='category', y='total_quantity_sold', data=df_cat, color='red', marker='^', linestyle='--', ax=ax2_secondary, label='Quantity Sold')

# Customizations
axes[0].set_title('Total Quantity Sold, Sales and Profit by Category')
axes[0].set_ylabel('Amount (USD)')
ax2_secondary.set_ylabel('Quantity Sold (Units)')  
ax2_secondary.legend(loc='upper center')
ax2_secondary.grid(False)  
#------------------------------------------------------------------------------------------------------------------------------

### 2. Profit Margin vs. Average Discount by Category
sns.barplot(data=df_cat, x='category', y='profit_margin', palette='coolwarm',hue='category',ax=axes[1], width=0.8,legend=False)
sns.lineplot(data=df_cat, x='category', y='average_discount',ax=axes[1], color='green', marker='o',linewidth=2, linestyle='--', label='Average Discount')
axes[1].set_title('Profit Margin vs. Average Discount by Category')
axes[1].set_ylabel('Percentage (%)')
#------------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flatten():
    ax.grid(True, linestyle=':', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)
    ax.set_xlabel('')
    ax.legend(loc='best')
#------------------------------------------------------------------------------------------------------------------------------ 
# Add Data Labels for Bar Charts
for ax in axes.flat:  
    fmt = '%.2f' if ax==axes[1] else '%.0f'
    for container in ax.containers:  
        ax.bar_label(container, fmt=fmt, color='black', fontweight='bold', fontsize=12, label_type='edge') 
#-------------------------------------------------------------------------------------------------------------------------------

# Adjust layout to avoid overlap
plt.tight_layout(pad=3)
plt.show()
No description has been provided for this image

Some Insights:¶

Technology:

  • The Technology category consists of four subcategories: Phones, Accessories, Machines, and Copiers.

  • This category has the highest total sales (USD 836,154.03), total profit(USD 145,454.95) and profit margin (17.40%), despite having the lowest total quantity unit sold. This suggests that even though fewer units were sold, the higher value of each unit drove substantial revenue and profit, likely due to premium pricing or high-margin products.

  • It contributes the highest share of total profit (50.79%) while accounting for only 36.4% of total sales.5, making it the strongest profit driver.

  • The discount is the lowest (13.23%) in this category, contributing probably to its strong profitability. Note: While all the subcategories contribute to sales, Phones are the primary driver of revenue.

    Furniture:

  • The Furniture Categories consists of four subcategories: Chairs, Furnishings, Bookcases and Tables.

  • Despite achieving the second-highest total quantity sold (8,028 units) and total sales (USD 741,999.80), this category has a very low profit (USD 18,451.27) and the lowest profit margin (2.49%).This indicates that while the volume of sales is strong, the profitability per unit is significantly low, possibly due to high costs or low-margin products.

  • While it accounts for 32.3% of total sales, it contributes only 6.44% of total profit, highlighting weak profitability relative to sales.This shows clearly that, although the category generates substantial revenue, it does not translate effectively into profit.

  • It has the highest discount level (17.39%), which appears to be a major factor eroding profit margins. Note: Chairs represent the main contributor to sales revenue in this category.

Office Supplies:

  • The Office Supplies Category comprises of nine subcategories: Paper, Binders, Labels, Storage, Art, Fasteners, Envelopes, Appliances and Supplies.
  • This category has the highest total quantity sold (22,906 units) but the lowest total sales (USD 719,047.03).This indicates that while the volume of units sold is significant, the average selling price per unit is relatively low compared to other categories, possibly due to a high proportion of lower-priced products.
  • It contributes 31.3% of sales but 42.77% of total profit, with a profit margin of 17.04%, close to that of Technology. This suggests that this category is highly profitable relative to its sales volume, likely due to high-margin products or effective cost control.
  • Discounts (15.73%) are moderate but appear well-managed, maintaining strong profitability. Note: The main driver of sales in this category is Binders.

Discount-Profitability Relationship:

The data suggests a negative correlation between discount levels and profitability:

  • Furniture : Highest Discount (17.39%) → Lowest Profit Margin(2.49%)
  • Technology: Lowest Discount(13.23%) → Highest Profit Margin(17.40%)
  • Office Supplies : Moderate Discount(15.73%) → Strong Profit Margin(17.04%) This reinforces the importance of discount optimization in maintaining profitability.
  • 2) Category Analysis over the Years¶
# a) Aggregate metrics at the category level for each year

df_cat_year = df.groupby(['year', 'category']).agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_cat_year['sales_share'] = (df_cat_year['total_sales'] / df_cat_year['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_cat_year['profit_share'] = (df_cat_year['total_profit'] / df_cat_year['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_cat_year['profit_margin'] = (df_cat_year['total_profit'] / df_cat_year['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_cat_year = df_cat_year.sort_values(by='total_sales', ascending=False, ignore_index=True)

# Display the result
df_cat_year.round(2).head()
year category total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 2024 Technology 2363 271730.81 50684.26 13.03 387 11.83 17.70 18.65
1 2024 Office Supplies 7676 246097.18 39736.62 16.00 634 10.71 13.87 16.15
2 2023 Technology 1698 226364.18 39773.99 13.66 290 9.85 13.89 17.57
3 2024 Furniture 2437 215387.27 3018.39 16.99 403 9.38 1.05 1.40
4 2023 Furniture 2193 198901.44 6959.95 17.69 355 8.66 2.43 3.50
 # b) Visualizing Key Metrics by Category over the Years

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a figure with a 3x2 grid of subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(22, 20))

# Set the background color for the figure area
fig.patch.set_facecolor('lightgrey')
# ------------------------------------------------------------------------------------------------------------------------------
# Preparing Data for stacked plots: reshaping the data  using single pivot_table() 
df_cat_year_pivot = df_cat_year.pivot(index='year',columns='category',values=['total_quantity_sold','total_sales','total_profit', 'number_of_customers'])

# N.B: Here,We use a single pivot_table() call instead of creating separate pivot tables for total quantity, total sales,...
#  which makes the code cleaner,more flexible and efficient.
# To access specific a specific metric: df_pivot['total_sales'] for example

# ------------------------------------------------------------------------------------------------------------------------------
# 1.Total Quantity by Category over the Years
df_cat_year_pivot['total_quantity_sold'].plot(kind='bar', stacked=True, ax=axes[0, 0], cmap='tab10', width=0.8)
axes[0, 0].set_title('Total Quantity Sold by Category Over the Years' )
axes[0, 0].set_ylabel('Quantity Sold (Units)')
#----------------------------------------------------------------------------------------------------------------------------
# 2. Total Sales by Category over the Years
df_cat_year_pivot['total_sales'].plot(kind='bar', stacked=True, ax=axes[0, 1], cmap='tab10', width=0.8)
axes[0, 1].set_title('Total Sales by Category Over the Years')
axes[0, 1].set_ylabel('Total Sales (USD)')
# ------------------------------------------------------------------------------------------------------------------------------
# 3. Total Profit by Category over the Years
df_cat_year_pivot['total_profit'].plot(kind='bar', stacked=True, ax=axes[1, 0], cmap='tab10', width=0.8)
axes[1, 0].set_title('Total Profit by Category Over the Years')
axes[1, 0].set_ylabel('Total Profit (USD)')
# ------------------------------------------------------------------------------------------------------------------------------
# 4. Number of Customers by Category over the Years
df_cat_year_pivot['number_of_customers'].plot(kind='bar', stacked=True, ax=axes[1, 1], cmap='tab10', width=0.8)
axes[1, 1].set_title('Number of Customer by Category Over the Years')  
axes[1, 1].set_ylabel('Number of Customers')
# ------------------------------------------------------------------------------------------------------------------------------
# 5. Profit Margin by Category over the Years
sns.barplot(x='year', y='profit_margin', data=df_cat_year, hue='category', palette='tab10', ax=axes[2, 0], errorbar=None)
sns.lineplot(x='year', y='profit_margin', data=yearly_aggr, ax=axes[2, 0], color='green', label='Overall Profit Margin')  
axes[2, 0].set_title('Profit Margin by Category Over the Years')
axes[2, 0].set_ylabel('Profit Margin (%)')
# ------------------------------------------------------------------------------------------------------------------------------
# 6. Average Discount by Category over the Years
sns.barplot(x='year', y='average_discount', data=df_cat_year, hue='category', palette='tab10', ax=axes[2, 1], errorbar=None)
sns.lineplot(x='year', y='average_discount', data=yearly_aggr, ax=axes[2, 1], color='orange', label='Overall Discount')
axes[2, 1].set_title('Discount by Category Over the Years')
axes[2, 1].set_ylabel('Discount (%)')
# ------------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flat:
    ax.grid(True, linestyle='--', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)
    ax.tick_params(axis='y')
    ax.set_xlabel('')
    ax.legend(loc='best') 
#------------------------------------------------------------------------------------------------------------------------------
# Add Data Labels for Bar Charts
for ax in axes.flat:  
    fmt = '%.2f' if ax in [axes[2, 0], axes[2, 1]] else '%.0f'
    for container in ax.containers:  # Loop through bar containers
        ax.bar_label(container, fmt=fmt, color='black', fontweight='bold', fontsize=12, label_type='center')     
#-----------------------------------------------------------------------------------------------------------------------------
# Adjust layout to avoid overlap
plt.tight_layout(pad=3)
plt.show()
#-----------------------------------------Note--------------------------------------------------------------------
                                    # Creating stacked bar plot
# Stacked bar plots are a powerful tool in data analytics for visualizing the contributions of different categories over time or across groups. 
# While Seaborn is excellent for many types of visualizations, it does not natively support stacked bar plots or pie charts. 
# For these types of plots, pandas and Matplotlib are commonly used.

# To create a stacked bar plot in pandas, our data often needs to be in a wide format, where each column represents a category,
# and each row represents a group (e.g., time periods or regions).
# If the  data is in a long format, we can reshape it using pivot() or pivot_table().
# N.B: pivot_table() can  handle aggregation (e.g., summing sales by category and year), while pivot() does not support  aggregation.
# -------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Some Insights:¶

1. Total Quantity by Product Category Over the Years:

  • Total quantity sold has steadily increased over time across all product categories, reaching its highest level in 2024.
  • Office Supplies consistently dominates in total quantity sold, holding the largest share each year.
  • Furniture follows as the second-largest category in terms of units sold, while Technology remains the lowest.
  • The growth in total quantity suggests increasing demand, but the disparity in sales volume across categories highlights differences in purchasing patterns.

2. Total Sales by Product Category over the Years:

  • Sales consistently increased over time for all product categories, except for the year 2022, where a slight decline is observed in the Technology and Office Supplies categories compared to 2021. This decline contributed to an overall drop in global yearly sales for 2022.
  • The Technology category-despite having lower units sold compared to the other categories- consistently leads in sales, except in 2022, where its sales were slightly below Furniture's level.This suggests that higher unit prices and premium product offerings drive its revenue performance.
  • The office Supplies category- despite its dominance in quantity sold-records the lowest sales figures over time, except in 2024, where it surpasses Furniture. This suggests that the average selling price per unit is significantly lower compared to other categories. So, Technology and Office Supplies play a critical role in driving overall revenue.

3. Total Profit by Product Category over the Years:

  • The Technology and Office Supplies categories exhibit consistent profit growth over time, reflecting their robust market performance.
  • Technology is consistently the most profitable category, except in 2021, where its profit was slightly below the level of Office Supplies.
  • Despite steady sales, Furniture shows variable and relatively low profit levels, indicating potential operational inefficiencies or high costs affecting its profitability. In that case, we can try to build on the strong profitability of Technology and Office Supplies and focus on improving cost structures and operational processes for Furniture.

4. Customer Base by Product Category over the Years :

  • The number of Customers shows consistent growth over time,despite a very slight decline in 2022 for Furniture and Office Supplies.
  • Office Supplies attracts the largest customer base, followed by Furniture and Technology.
  • Technology has the lowest customer count but contributes the most to sales and profits, indicating a higher-value customer segment with higher spending per customer.

5. Average Profit Margin by Product Category over the Years:

  • Technology and Office Supplies are the strongest performers in terms of profit margin, with Technology reaching its highest margins in 2022 and 2024, while Office Supplies peaked in 2021 and 2023. This indicates fluctuations in pricing strategies or cost structures across years.
  • Furniture consistently shows the lowest profit margin, staying below 5%. This reinforces the profitability concerns in this segment, as mentioned earlier.

6. Average Discount by Product Category over the Years:

  • The Furniture category consistently receives the highest average discounts (above 16%), which may partially explain its lower profit margins. The Office Supplies category follows, receiving moderate discounts, while Technology consistently benefits from the lowest discount rates, contributing to its strong profitability.
  • Globally, discount rates have remained relatively stable over the years, with no noticeable upward or downward trends. Reducing discount rates for Furniture or offering targeted discounts (e.g., on less profitable items) could enhance margins without hurting sales significantly.
3.3.4.1.2. Analysis by Subcategory of Products¶
  • 1) Subcategory Analysis over the Entire Period¶
# a) Aggregate metrics at the sub-category level for the entire period

df_subcat = df.groupby(['category', 'sub_category']).agg(
   total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'), 
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_subcat['sales_share'] = (df_subcat['total_sales'] / df_subcat['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_subcat['profit_share'] = (df_subcat['total_profit'] / df_subcat['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_subcat['profit_margin'] = (df_subcat['total_profit'] / df_subcat['total_sales']) * 100

# Display the DataFrame
df_subcat.round(2).head()
category sub_category total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 Furniture Bookcases 868 114880.00 -3472.56 21.11 195 5.00 -1.21 -3.02
1 Furniture Chairs 2356 328449.10 26590.17 17.02 407 14.30 9.28 8.10
2 Furniture Furnishings 3563 91705.16 13059.14 13.83 528 3.99 4.56 14.24
3 Furniture Tables 1241 206965.53 -17725.48 26.13 261 9.01 -6.19 -8.56
4 Office Supplies Appliances 1729 107532.16 18138.01 16.65 356 4.68 6.33 16.87
# b) Visualization: Top Performing Subcategories over the Entire Period

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(22, 20))

# Set background color for the entire figure
fig.patch.set_facecolor('lightgrey')

# -----------------------------------------------------------------------------------------------------------------------------
### 1. Top 10 Performing Subcategories by Quantity Sold

# Sort by total quantity and select top 10
top_10_subcat_quantity = df_subcat.nlargest(10, 'total_quantity_sold')

# Barplot for Quantity Sold
sns.barplot(x='sub_category', y='total_quantity_sold', hue='sub_category', data=top_10_subcat_quantity, palette='Dark2', ax=axes[0,0], alpha=0.85, legend=False)
axes[0, 0].set_title('Top 10 Performing Subcategories by Quantity Sold')
axes[0, 0].set_ylabel('Quantity Sold (Units)')

# -------------------------------------------------------------------------------------------------------------------------------
### 2. Top 10 Performing Subcategories by Sales

# Sort by total sales and select top 10
top_10_subcat_sales = df_subcat.nlargest(10, 'total_sales')

# Barplot for Sales
sns.barplot(x='sub_category', y='total_sales', hue='sub_category', data=top_10_subcat_sales, palette='Dark2', ax=axes[0,1], alpha=0.85, legend=False)
axes[0, 1].set_title('Top 10 Performing Subcategories by Sales')
axes[0, 1].set_ylabel('Total Sales (USD)')

# ------------------------------------------------------------------------------------------------------------------------------
### 3. Top 10 Performing Subcategories by Profit

# Sort by total profit and select top 10
top_10_subcat_profit =df_subcat.nlargest(10, 'total_profit')

# Barplot for Profit
sns.barplot(x='sub_category', y='total_profit', data=top_10_subcat_profit, palette='Dark2',hue='sub_category', ax=axes[1, 0], alpha=0.85, legend=False)
axes[1, 0].set_title('Top 10 Performing Subcategories by Profit')
axes[1, 0].set_ylabel('Total Profit (USD)')

# -------------------------------------------------------------------------------------------------------------------------------
### 4. Top 10 Performing Subcategories by Profit Margin

# Sort by profit margin and select top 10
top_10_subcat_margin =df_subcat.nlargest(10, 'profit_margin')

# Barplot for Profit Margin
sns.barplot(x='sub_category', y='profit_margin', data=top_10_subcat_margin, palette='Dark2', hue='sub_category',ax=axes[1, 1], alpha=0.85, legend=False)

# Overlay lineplot for Average Discount
sns.lineplot(x='sub_category', y='average_discount', data=top_10_subcat_margin,color='red', marker='o', ax=axes[1, 1], alpha=0.85, linewidth=2)
axes[1, 1].set_title('Top 10 Performing Subcategories by Profit Margin')
axes[1, 1].set_ylabel('Profit Margin (%)')
#-----------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flat:
    ax.grid(True, linestyle=':', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)
    ax.set_xlabel('')
# ---------------------------------------------------------------------------------------------------------------------------- 
# Add Data Labels for Bar Charts
for ax in axes.flat:  
    fmt = '%.2f' if ax==axes[1,1] else '%.0f'
    for container in ax.containers: 
        ax.bar_label(container, fmt=fmt, color='black', fontweight='bold', fontsize=12, label_type='edge')              
#------------------------------------------------------------------------------------------------------------------------------
# Adjust layout to avoid overlap
plt.tight_layout(pad=3)

# Show the plot
plt.show()
No description has been provided for this image

Some Insights:¶

1️. Top 10 Performing Subcategories by Quantity Sold:

  • Binders (5,974 Units) and Paper (5,178) are the most sold subcategories, showing strong demand.
  • Furnishings, Phones, Storage, Art, Accessories, and Chairs follow, with sales between 2,000-3,600 units.
  • Appliances (1,729) and Labels (1,400) have the lowest quantity sold.

2. Top 10 Performing Subcategories by Total Sales:

  • Phone (USD 330,007) and Chairs (USD 328,449) lead in total sales, contributing over 28% of total revenue, despite ranking 4th and 8th in quantity sold, respectively.
  • Storage, Tables, and Binders form the next tier, with sales ranging between USD 200,000 and USD 225,000.
  • The remaining subcategories—Machines, Accessories, Copiers, Bookcases, and Appliances—generate less than USD 200,000 each.
  • Binders, despite being the best-selling subcategory, ranks only 5th in revenue, indicating a low price per unit.
  • Paper, Furnishings, and Art—despite high quantity sold—are absent from the top sales categories, suggesting low unit prices.

3. Top 10 Performing Subcategories by Total Profit:

  • Copiers generate the highest profit (USD 55,618), despite relatively low sales (USD 149,528), indicating a high-profit margin.
  • Phones, lead in sales and ranks second in profit(USD 44,516),reinforcing their strong financial performance.
  • Accessories(USD 41,937), despite ranking 7th in both quantity and sales, outperform many higher-volume subcategories in profit.
  • Paper, Binders,Chairs and Storage contribute moderately to profit, despite selling in high volumes or values.
  • Envelopes rank among the top 10 most profitable subcategories (USD 6,964), despite not being a top-selling subcategory. This suggests they benefit from a strong profit margin.
  • Tables and Bookcases report negative profits (USD -17,725 and USD -3,472, respectively) despite their moderate sales, meaning these sub-categories operate at a loss. A key observation is that some high-revenue subcategories (e.g., Chairs) do not translate into high profits, possibly due to cost factors.

4. Top 10 Performing Subcategories by Profit Margin:

  • Labels (44.42%), Paper (43.39%), and Envelopes (42.27%) have the highest profit margins, suggesting they are highly profitable relative to their sales,making them efficient from a profitability standpoint.
  • Copiers, despite generating the highest total profit, rank 4th in profit margin (37.2%). This suggests that their high total profit is driven by sales volume rather than margin efficiency, indicating relatively high costs or discounting.
  • Furniture-related subcategories (Binders, Furnishings, and Chairs) generate good sales but have margins less than 15%.

Relationship between Profit margin and discount:

While a clear correlation between profit margin and discounting is not explicitly shown (based on subcategories), some key patterns emerge: - High-profit margin subcategories (Labels, Paper, Envelopes) typically have lower discounts. - Binders, a low-margin category, receives the highest discounts, which likely contributes to its high sales volume but lower profitability.

  • 2) SubCategory Analysis over the Years¶
# a) Aggregate metrics at the sub-category level for each year

df_subcat_year = df.groupby(['year', 'sub_category']).agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_subcat_year ['sales_share'] = (df_subcat_year ['total_sales'] / df_subcat_year ['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_subcat_year ['profit_share'] = (df_subcat_year ['total_profit'] / df_subcat_year ['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_subcat_year ['profit_margin'] = (df_subcat_year ['total_profit'] / df_subcat_year ['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_subcat_year  = df_subcat_year .sort_values(by='total_sales', ascending=False, ignore_index=True)

# Display the result
df_subcat_year .round(2).head()
year sub_category total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 2024 Phones 1090 105340.52 12849.32 16.39 244 4.59 4.49 12.20
1 2024 Chairs 674 95554.35 7643.55 16.74 149 4.16 2.67 8.00
2 2023 Chairs 614 83918.64 5763.15 17.88 137 3.65 2.01 6.87
3 2023 Phones 818 78962.03 9459.59 15.56 178 3.44 3.30 11.98
4 2021 Phones 684 77390.81 11808.01 15.32 136 3.37 4.12 15.26
# b) Visualization: Top Performing Subcategories over the Years

# Set Seaborn theme 
sns.set_theme(style='whitegrid',rc={ "axes.titlesize": 16, "axes.titleweight": "bold", "axes.labelsize": 14, "axes.labelweight": "bold",
                                    "legend.fontsize": 11, "xtick.labelsize": 12, "ytick.labelsize": 12})

#  Create a 2x2 grid of subplots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(22, 18))

# Set background color 
fig.patch.set_facecolor('lightgrey')

# -----------------------------------------------------------------------------------------------------------------------------
# 1. Top 5 Subcategories by Total Quantity Sold over the Years

# Identify the top 5 subcategories with the highest total quantity
top_5_subcat_quantity = df_subcat.nlargest(5, 'total_quantity_sold')

# Filter yearly data for the selected top 5 subcategories
top_5_subcat_quantity_year = df_subcat_year[df_subcat_year['sub_category'].isin(top_5_subcat_quantity['sub_category'])]

# Pivot data to format it for stacked bar plotting
subcat_quantity_year_pivot = top_5_subcat_quantity_year.pivot(index='year', columns='sub_category', values='total_quantity_sold')

# Plot a stacked bar chart to show yearly total quantity trends
subcat_quantity_year_pivot.plot(kind='bar', stacked=True, ax=axes[0, 0], width=0.8, cmap='tab10')
axes[0, 0].set_title('Top 5 Subcategories by Quantity over the Years')
axes[0, 0].set_ylabel('Quantity Sold (Units)')

# -----------------------------------------------------------------------------------------------------------------------------
# 2. Top 5 Subcategories by Total Sales over the Years

# Identify the top 5 subcategories with the highest total sales
top_5_subcat_sales = df_subcat.nlargest(5, 'total_sales')

# Filter yearly data for the selected top 5 subcategories
top_5_subcat_sales_year = df_subcat_year[df_subcat_year['sub_category'].isin(top_5_subcat_sales['sub_category'])]

# Pivot data to format it for stacked bar plotting
subcat_sales_year_pivot = top_5_subcat_sales_year.pivot(index='year', columns='sub_category', values='total_sales')

# Plot a stacked bar chart to show yearly total sales trends
subcat_sales_year_pivot.plot(kind='bar', stacked=True, ax=axes[0, 1], width=0.8, cmap='tab10')
axes[0, 1].set_title('Top 5 Subcategories by Sales over the Years')
axes[0, 1].set_ylabel('Total Sales (USD)')

# -----------------------------------------------------------------------------------------------------------------------------
# 3. Top 5 Subcategories by Total Profit over the Years

# Identify the top 5 subcategories with the highest total profit
top_5_subcat_profit = df_subcat.nlargest(5, 'total_profit')

# Filter yearly data for the selected top 5 subcategories
top_5_subcat_profit_year = df_subcat_year[df_subcat_year['sub_category'].isin(top_5_subcat_profit['sub_category'])]

# Pivot data to format it for stacked bar plotting
subcat_profit_year_pivot = top_5_subcat_profit_year.pivot(index='year', columns='sub_category', values='total_profit')

# Plot a stacked bar chart to show yearly total profit trends
subcat_profit_year_pivot.plot(kind='bar', stacked=True, ax=axes[1, 0], width=0.8, colormap='tab10')
axes[1, 0].set_title('Top 5 Subcategories by Profit over the Years')
axes[1, 0].set_ylabel('Total Profit (USD)')

# -----------------------------------------------------------------------------------------------------------------------------
# 4. Top 5 Subcategories by Profit Margin over the Years

# Identify the top 5 subcategories with the highest profit margin
top_5_subcat_margin = df_subcat.nlargest(5, 'profit_margin')

# Filter yearly data for the selected top 5 subcategories
top_5_subcat_margin_year = df_subcat_year[df_subcat_year['sub_category'].isin(top_5_subcat_margin['sub_category'])]

# Plot a grouped bar chart for profit margin trends across years
sns.barplot(x='year', y='profit_margin', data=top_5_subcat_margin_year, hue='sub_category',
    palette='tab10', ax=axes[1, 1], errorbar=None)

# Overlay a line plot showing the overall profit margin trend
sns.lineplot(x='year', y='profit_margin', data=yearly_aggr, ax=axes[1, 1], color='green', label='Overall Profit Margin')

# Set plot title and labels
axes[1, 1].set_title('Top 5 Subcategories by Profit Margin over the Years')
axes[1, 1].set_ylabel('Profit Margin (%)')

# -----------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flat:
    ax.tick_params(axis='x', rotation=45)  
    ax.set_xlabel('')  
    ax.legend(title="Subcategory", loc='best')

# -----------------------------------------------------------------------------------------------------------------------------
# Add Data Labels to Bars 
for ax in axes.flat:
    if ax != axes[1, 1]:  # Exclude the profit margin plot (axes[1, 1])
        for container in ax.containers:
            ax.bar_label(container, fmt='%.0f', color='black', fontweight='bold', fontsize=14, label_type='center')
# -----------------------------------------------------------------------------------------------------------------------------
# Adjust Layout
plt.tight_layout(pad=3)

# Display the Plots
plt.show()
No description has been provided for this image

Some Insights:¶

1. Top 5 Subcategories by Total Quantity Sold over the Years:

  • The top 5 subcategories in total quantity sold are Binders,Paper, Furnishings, Phones and Storage.
  • All these subcategories experienced steady growth each year, except for Storage, which saw a slight decline in 2022 before rebounding.
  • Binders and Paper consistently led in quantity sold each year, maintaining their position as the top two contributors. -From 2022 onward, Furnishings, Phones, and Storage followed in order, while in 2021, Phones ranked slightly higher than Furnishings. The overall trend indicates increasing demand across all major subcategories, reaching peak levels in 2024.

2. Top 5 Subcategories by Total Sales over the Years:

  • The top five subcategories in total sales are Phones, Chairs, Storage, Tables, and Binders.
  • After a decline in 2022, total sales rebounded across all these subcategories, showing consistent growth and peaking in 2024.
  • Phones and Chairs were consistently the top two contributors to total sales. Phones ranked first in 2021 and 2024, while Chairs took the lead in 2022 and 2023.
  • Storage consistently held third place in total sales, except in 2023, when Tables (which is not among the top 5 in quantity sold) outperformed it.
  • Binders, despite leading in quantity sold, ranked fifth in total sales from 2021 to 2023 but moved up to third place in 2024.

3. Top 5 Subcategories by Total Profit over the Years:

  • The top 5 subcategories in total profit are Phones, Accessories, Paper, Copiers, and Binders.
  • Copiers and Paper demonstrated consistent profit growth, peaking in 2024. Copiers saw an exceptional surge from USD 2,913 in 2021 to USD 25,032 in 2024, representing a 759% increase.
  • Phones, despite an upward trend in sales, experienced a profit decline from 2021 to 2023. However, they rebounded strongly in 2024, regaining their position as a key profit driver.
  • Accessories showed an increasing trend over time, while Binders peaked in 2023 before experiencing a slight decline in 2024.
  • Shifts in profit contribution were evident across the years. In 2021 and 2022, Accessories (despite not being among the top five best-selling subcategories) and Phones were the dominant profit drivers, whereas in 2023 and 2024, Copiers significantly outperformed all other subcategories."

4. Top 5 Subcategories by Profit Margin over the Years:

  • The top 5 subcategories in terms of Profit Margin are Paper, Copiers, Envelopes, Labels, and Fasteners.
  • Notably, Envelopes, Labels, and Fasteners are neither among the top five best-selling subcategories nor the top five in total profit, which highlights their high profitability relative to sales volume.
  • Labels, Paper, and Envelopes consistently hold the top three positions in Profit Margin across the years, reflecting strong unit-level profitability despite lower total sales.
  • Copiers and Fasteners follow, maintaining also solid margins.
3.3.4.1.3. Product Analysis¶
# a) Aggregate metrics at the product level for the entire period
df_prod = df.groupby('product_name').agg(
    total_sales=('sales', 'sum'),
    total_profit=('profit', 'sum'),
    total_quantity=('quantity', 'sum')
).reset_index()

# Calculate profit margin (%)
df_prod['profit_margin'] = (df_prod['total_profit'] / df_prod['total_sales']) * 100

# Identify top and least performing products
most_profitable_products= df_prod.sort_values(by='total_profit', ascending=False).head(10)
least_profitable_products = df_prod.sort_values(by='total_profit', ascending=True).head(10)

# Display results
print("Most Profitable Products:")
print(most_profitable_products.round(2))

print("\nLeast Profitable Products:")
print(least_profitable_products.round(2))
Most Profitable Products:
                                           product_name  total_sales  \
404               Canon imageCLASS 2200 Advanced Copier     61599.82   
649   Fellowes PB500 Electric Punch Plastic Comb Bin...     27453.38   
804                Hewlett Packard LaserJet 3310 Copier     18839.69   
400                  Canon PC1060 Personal Laser Copier     11619.83   
786   HP Designjet T520 Inkjet Large Format Printer ...     18374.90   
165                   Ativa V4110MDD Micro-Cut Shredder      7699.89   
19     3D Systems Cube Printer, 2nd Generation, Magenta     14299.89   
1275  Plantronics Savi W720 Multi-Device Wireless He...      9367.29   
894                Ibico EPK-21 Electric Binding System     15875.92   
1839                  Zebra ZM400 Thermal Label Printer      6965.70   

      total_profit  total_quantity  profit_margin  
404       25199.93              20          40.91  
649        7753.04              31          28.24  
804        6983.88              38          37.07  
400        4570.93              19          39.34  
786        4094.98              12          22.29  
165        3772.95              11          49.00  
19         3717.97              11          26.00  
1275       3696.28              24          39.46  
894        3345.28              13          21.07  
1839       3343.54               6          48.00  

Least Profitable Products:
                                           product_name  total_sales  \
475           Cubify CubeX 3D Printer Double Head Print     11099.96   
984           Lexmark MX611dhe Monochrome Laser Printer     16829.90   
476           Cubify CubeX 3D Printer Triple Head Print      7999.98   
425   Chromcraft Bull-Nose Wood Oval Conference Tabl...      9917.64   
376   Bush Advantage Collection Racetrack Conference...      9544.72   
682           GBC DocuBind P400 Electric Binding System     17965.07   
444   Cisco TelePresence System EX90 Videoconferenci...     22638.48   
1042  Martin Yale Chadless Opener Electric Letter Op...     16656.20   
285                        Balt Solid Wood Round Tables      6518.75   
364   BoxOffice By Design Rectangular and Half-Moon ...      1706.25   

      total_profit  total_quantity  profit_margin  
475       -8879.97               9         -80.00  
984       -4589.97              18         -27.27  
476       -3839.99               4         -48.00  
425       -2876.12              27         -29.00  
376       -1934.40              33         -20.27  
682       -1878.17              27         -10.45  
444       -1811.08               6          -8.00  
1042      -1299.18              22          -7.80  
285       -1201.06              19         -18.42  
364       -1148.44              15         -67.31  
# b) Aggregate metrics at the product level for each year
df_prod_year = df.groupby(['year', 'product_name']).agg(
    total_quantity=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_prod_year['sales_share'] = (df_prod_year['total_sales'] / df_prod_year['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_prod_year['profit_share'] = (df_prod_year['total_profit'] / df_prod_year['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_prod_year['profit_margin'] = (df_prod_year ['total_profit'] / df_prod_year ['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_prod_year  = df_prod_year.sort_values(by='total_sales', ascending=False, ignore_index=True)

# Display the result
df_prod_year.round(2).head()
year product_name total_quantity total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 2024 Canon imageCLASS 2200 Advanced Copier 11 35699.90 15679.96 6.67 3 1.55 5.47 43.92
1 2023 Canon imageCLASS 2200 Advanced Copier 9 25899.93 9519.97 20.00 2 1.13 3.32 36.76
2 2021 Cisco TelePresence System EX90 Videoconferenci... 6 22638.48 -1811.08 50.00 1 0.99 -0.63 -8.00
3 2022 Fellowes PB500 Electric Punch Plastic Comb Bin... 13 15760.28 7498.84 5.00 4 0.69 2.62 47.58
4 2023 GBC Ibimaster 500 Manual ProClick Binding System 26 12860.56 2967.82 52.50 4 0.56 1.04 23.08

3.3.4.2. Customer Segment Analysis¶

Objective: Evaluate customer segments to understand their contributions to overall revenue and profitability.

This analysis will be conducted in two steps: data aggregation and data visualization to systematically derive insights from the data.

  • 1) Segment Analysis over the Entire Period¶
# a) Aggregate metrics by Segment for the entire period

df_seg = df.groupby('segment').agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_seg['sales_share'] = (df_seg['total_sales'] / df_seg['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_seg['profit_share'] = (df_seg['total_profit'] / df_seg['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_seg['profit_margin'] = (df_seg['total_profit'] / df_seg['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_seg = df_seg.sort_values(by='total_sales', ascending=False, ignore_index=True)
 
# Display the result
df_seg.round(2).head()
segment total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 Consumer 19521 1161401.34 134119.21 15.81 409 50.56 46.83 11.55
1 Corporate 11608 706146.37 91979.13 15.82 236 30.74 32.12 13.03
2 Home Office 6744 429653.15 60298.68 14.71 148 18.70 21.05 14.03
### b) Visualizing Key Metrics by Segment over the Entire Period

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a 1x2 grid of subplots 
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

# Set background color for the entire figure
fig.patch.set_facecolor('lightgrey')

#-------------------------------------------------------------------------------------------------------------------------------
### 1. Total Sales, Total Profit & Total Quantity by Segment

# Create side-by-side bar plots for Total Sales and Profit
melted_seg = df_seg.melt(id_vars='segment', value_vars=['total_sales', 'total_profit'], var_name='Metric', value_name='Value')
melted_seg['Metric'] = melted_seg['Metric'].replace({'total_sales': 'Total Sales','total_profit': 'Total Profit'})
sns.barplot( x='segment',  y='Value', hue='Metric', data=melted_seg, palette=['blue', 'orange'], ax=axes[0], alpha=0.85)

# Tota Create line plot for Quantity Sold
ax2_secondary = axes[0].twinx()  
sns.lineplot(x='segment', y='total_quantity_sold', data=df_seg, color='red', marker='^', linestyle='--', ax=ax2_secondary, label='Total Quantity Sold')

# Customizations
axes[0].set_title('Total Quantity Sold, Sales and Profit by Segment')
axes[0].set_ylabel('Amount (USD)')
ax2_secondary.set_ylabel('Quantity Sold (Units)')  
ax2_secondary.grid(False)  
axes[0].legend(title=None,loc='upper center')
ax2_secondary.legend(loc='upper center',bbox_to_anchor=(0.5, 0.94))
#------------------------------------------------------------------------------------------------------------------------------

### 2. Profit Margin vs. Average Discount by Category

# Create a bar plot for Profit Margin and a line plot for Discount
sns.barplot(data=df_seg, x='segment', y='profit_margin', palette='coolwarm',hue='segment',ax=axes[1], width=0.8,legend=False)
sns.lineplot(data=df_seg, x='segment', y='average_discount',ax=axes[1], color='green', marker='o',linewidth=2, linestyle='--', label='Average Discount')
# Customizations
axes[1].set_title('Profit Margin vs. Average Discount by Segment')
axes[1].set_ylabel('Percentage (%)')
axes[1].legend(loc='upper right')
#------------------------------------------------------------------------------------------------------------------------------      
# Apply Common Customizations to all Subplots
for ax in axes.flat:  
    ax.tick_params(axis='x', rotation=45)
    ax.tick_params(axis='y')        
    ax.set_xlabel('')  
#------------------------------------------------------------------------------------------------------------------------------
# Add Data Labels for Bar Charts
for ax in axes.flat:  
    fmt = '%.2f' if ax==axes[1] else '%.0f'
    for container in ax.containers:  
        ax.bar_label(container, fmt=fmt, color='black', fontweight='bold', fontsize=12, label_type='edge') 
#------------------------------------------------------------------------------------------------------------------------------
# Adjust layout 
plt.tight_layout(pad=3)

# Display the plots
plt.show()    
No description has been provided for this image

Some Insights:¶

Consumer:

  • This segment has the highest total quantity sold (19,521 units), total sales (USD 1,161,401.34), and total profit (USD 134,119.21). This suggests that high sales volume is the key driver of revenue and profit in this segment.
  • It contributes the highest share of total profit (46.83%) while accounting for 50.56% of total sales, making it the largest revenue contributor.
  • The discount level (15.81%) is relatively high, which may be eroding profitability despite strong revenue generation.
  • The profit margin (11.55%) is the lowest among the three segments, indicating that although revenue is high, profit per dollar of sales is lower due to discounts or cost structures.

Corporate:

  • This segment achieves moderate total sales (USD 706,146.37) and profit (USD 91,979.13) while selling 11,608 units, indicating balanced sales volume and profitability.
  • It represents 30.74% of total sales but contributes 32.12% of total profit, showing a relatively strong balance between sales and profit contribution.
  • The discount level (15.82%) is nearly the same as Consumer, but it maintains a higher profit margin (13.03%), suggesting that it benefits from better cost control or higher-margin products compared to Consumer.

Home Office:

  • This segment has the lowest total quantity sold (6,744 units),total sales (USD 429,653.15) and total profit (USD 60,298.68).
  • It contributes only 18.70% of total sales but 21.05% of total profit, making it the most profitable segment relative to sales.
  • The profit margin is the highest (14.03%), and receives slightly lower discounts (14.71%) than the other two segments. This indicates that a lower discount strategy contributes to its strong profitability.

Discount-Profitability Relationship:

The data tends to suggest a negative correlation between discount levels and profitability:

  • Consumer: High Discount (15.81%) → Lowest Profit Margin (11.55%).
  • Corporate: High Discount (15.82%) → Moderate Profit Margin (13.03%).
  • Home Office: Lowest Discount (14.71%) → Highest Profit Margin (14.03%).

This reinforces the importance of discount optimization in maintaining profitability, especially for segments with high sales volume.

  • 2) Segment Analysis over the Years¶
# a) Aggregate metrics at the Segment level for each year
df_seg_year = df.groupby(['year', 'segment']).agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_seg_year['sales_share'] = (df_seg_year['total_sales'] / df_seg_year['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_seg_year['profit_share'] = (df_seg_year['total_profit'] / df_seg_year['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_seg_year['profit_margin'] = (df_seg_year['total_profit'] / df_seg_year['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_seg_year = df_seg_year.sort_values(by='total_sales', ascending=False, ignore_index=True)

# Display the result
df_seg_year.round(2).head()
year segment total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 2024 Consumer 6282 331904.70 45568.24 15.83 361 14.45 15.91 13.73
1 2023 Consumer 4914 296863.90 35770.95 14.60 331 12.92 12.49 12.05
2 2022 Consumer 4272 266535.93 28460.17 15.76 295 11.60 9.94 10.68
3 2021 Consumer 4053 266096.81 24319.85 17.34 311 11.58 8.49 9.14
4 2024 Corporate 3758 241847.82 26782.36 15.78 204 10.53 9.35 11.07
 # b) Visualizing Key Metrics by Segment over the Years
    
# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a figure with a 3x2 grid of subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(22, 20))

# Set the background color for the figure area
fig.patch.set_facecolor('lightgrey')
# ------------------------------------------------------------------------------------------------------------------------------
# Preparing Data for stacked plots: reshaping the data  using single pivot_table() 
df_seg_year_pivot = df_seg_year.pivot_table(index='year', columns='segment', 
    values=['total_quantity_sold', 'total_sales', 'total_profit', 'number_of_customers'], 
    aggfunc='sum',fill_value=0)

# ------------------------------------------------------------------------------------------------------------------------------
# 1. Total Quantity by Segment overthe Years
df_seg_year_pivot['total_quantity_sold'].plot(kind='bar', stacked=True, ax=axes[0, 0], cmap='crest', width=0.8)
axes[0, 0].set_title('Total Quantity Sold by Segment over the Years')
axes[0, 0].set_ylabel('Quantity Sold (Units)')
# ------------------------------------------------------------------------------------------------------------------------------
# 2. Total Sales by Segment over the Years
df_seg_year_pivot['total_sales'].plot(kind='bar', stacked=True, ax=axes[0, 1], cmap='crest', width=0.8)
axes[0, 1].set_title('Total Sales by Segment over the Years')
axes[0, 1].set_ylabel('Total Sales (USD)')
# ------------------------------------------------------------------------------------------------------------------------------
# 3. Total Profit by Segment over the Years
df_seg_year_pivot['total_profit'].plot(kind='bar', stacked=True, ax=axes[1, 0], cmap='crest', width=0.8)
axes[1, 0].set_title('Contribution of Each Segment  in Total Profit by Year')
axes[1, 0].set_ylabel('Total Profit (USD)')
# ------------------------------------------------------------------------------------------------------------------------------
# 4. Number of Customers by Segment over the Years
df_seg_year_pivot['number_of_customers'].plot(kind='bar', stacked=True, ax=axes[1, 1], cmap='crest', width=0.8)
axes[1, 1].set_title('Number of Customers by Segment over the Years')  
axes[1, 1].set_ylabel('Number of Customers')
# ------------------------------------------------------------------------------------------------------------------------------
# 5. Profit Margin by Segment overthe Years
sns.barplot(x='year', y='profit_margin', data=df_seg_year, hue='segment', palette='crest', ax=axes[2, 0], errorbar=None)
sns.lineplot(x='year', y='profit_margin', data=yearly_aggr, ax=axes[2, 0], color='green', label='Overall Profit Margin')  
axes[2, 0].set_title('Profit Margin by Segment over the Years')
axes[2, 0].set_ylabel('Profit Margin (%)')
# ------------------------------------------------------------------------------------------------------------------------------
# 6. Average Discount by Segment over the Years
sns.barplot(x='year', y='average_discount', data=df_seg_year, hue='segment', palette='crest', ax=axes[2, 1], errorbar=None)
sns.lineplot(x='year', y='average_discount', data=yearly_aggr, ax=axes[2, 1], color='orange', label='Overall Discount')
axes[2, 1].set_title('Average Discount by Segment over the Years')
axes[2, 1].set_ylabel('Discount (%)')
#----------------------------------------------------------------------------------------------------------------------------    
# Apply Common Customizations to all Subplots
for ax in axes.flat:   
    ax.grid(True, linestyle='--', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)       
    ax.set_xlabel('')
    ax.legend(loc='best')  
#---------------------------------------------------------------------------------------------------------------------------
for ax in axes.flat:  
    fmt = '%.2f' if ax in [axes[2, 0], axes[2, 1]] else '%.0f'
    for container in ax.containers:  # Loop through bar containers
        ax.bar_label(container, fmt=fmt, color='black', fontweight='bold', fontsize=14, label_type='center')      
#-------------------------------------------------------------------------------------------------------------------------
# Adjust layout 
plt.tight_layout(pad=3)

# Display the plots
plt.show()    
No description has been provided for this image

Some Insights:¶

1. Total Quantity by Segment Over the Years

  • Total quantity sold has steadily increased over time across all segments, reaching its highest level in 2024.
  • The Consumer segment consistently dominates in total quantity sold, holding the largest share each year.
  • The Corporate segment follows as the second-largest in terms of units sold, while Home Office consistently has the lowest quantity sold. The overall growth in total quantity suggests increasing demand, but the disparity in sales volume across segments highlights differences in purchasing patterns.

2. Total Sales by Segment Over the Years:

  • Total sales have generally increased over time across all segments, except in 2022, where a slight decline is observed in the - Home Office segment compared to 2021. This contributed to an overall drop in global yearly sales for that year.
  • The Consumer segment remains the dominant driver of total sales, maintaining the highest revenue share each year.
  • The Corporate segment ranks second in sales, while Home Office consistently has the lowest sales. Given its leading role in both sales volume and revenue, the Consumer segment plays a critical role in overall business performance.

3. Total Profit by Segment Over the Years:

  • Total profit has consistently increased over time across all segments, except for 2024, where a slight decline is observed in the Corporate segment compared to 2023. However, this decline is offset by profit growth in the Consumer and Home Office segments, ensuring that total yearly profit did not decrease in 2024.
  • The Consumer segment dominates in total profit, followed by Corporate and then Home Office.
  • The Home Office segment, while consistently having the lowest profit, has shown relative improvements in recent years.

4. Number of Customers by Segment Over the Years:

  • The total number of customers has grown over time, despite a very slight decline in 2022 for the Consumer and Corporate segments.
  • The Consumer segment has the largest customer base each year, reinforcing its dominant position in terms of both sales and total quantity sold. The Corporate segment follows, while the Home Office segment has the smallest customer base.

5. Average Profit Margin by Segment Over the Years:

  • The Consumer segment shows a consistent increase in profit margin over time, rising from 9.14% in 2021 to 13.73% in 2024.
  • The Corporate and Home Office segments saw an increase in 2022 compared to 2021 but have since experienced a gradual decline in profit margin.
  • The Home Office segment led in 2021 and 2022, but ranked second in 2023 and 2024.
  • Notably, while Consumer leads in sales and profit, it had the lowest profit margin from 2021 to 2023, only surpassing Home Office in 2024.

6. Average Discount by Segment Over the Years:

  • The Consumer and Corporate segments generally receive the highest average discounts, except in 2023, when Home Office surpassed the other segments.

3.3.4.3. Geographical Analysis¶

Objective: Discover geographical performance differences at both region and state levels.

This analysis will be conducted in two steps: data aggregation and data visualization to systematically derive insights from the data.

3.3.4.3.1. Regional Analysis¶

  • 1) Regional Analysis over the Entire Period¶
# a) Aggregate metrics at the region level for the entire period

df_reg = df.groupby('region').agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_reg['sales_share'] = (df_reg['total_sales'] / df_reg['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_reg['profit_share'] = (df_reg['total_profit'] / df_reg['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_reg['profit_margin'] = (df_reg['total_profit'] / df_reg['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_reg = df_reg.sort_values(by='total_sales', ascending=False, ignore_index=True)

# Display the result
df_reg.round(2).head()
region total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 West 12266 725457.82 108418.45 10.93 686 31.58 37.86 14.94
1 East 10618 678781.24 91522.78 14.54 674 29.55 31.96 13.48
2 Central 8780 501239.89 39706.36 24.04 629 21.82 13.86 7.92
3 South 6209 391721.90 46749.43 14.73 512 17.05 16.32 11.93
### b) Visualizing Key Metrics by Region over the Entire Period

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a 1x2 grid of subplots 
fig, axes = plt.subplots(1, 2, figsize=(22, 8))

# Set background color for the entire figure
fig.patch.set_facecolor('lightgrey')
#-------------------------------------------------------------------------------------------------------------------------------
### 1. Total Sales, Total Profit & Total Quantity by Region

# Create side-by-side bar plots for Total Sales and Profit
melted_reg = df_reg.melt(id_vars='region', value_vars=['total_sales', 'total_profit'], var_name='Metric', value_name='Value')
melted_reg['Metric'] = melted_reg['Metric'].replace({'total_sales': 'Total Sales','total_profit': 'Total Profit'})
sns.barplot( x='region',  y='Value', hue='Metric', data=melted_reg, palette=['blue', 'orange'], ax=axes[0], alpha=0.85)

# Create a line plot for Total Quantity on a Secondary axis
ax2_secondary = axes[0].twinx()  
sns.lineplot(x='region', y='total_quantity_sold', data=df_reg, color='red', marker='^', linestyle='--',  ax=ax2_secondary, label='Total Quantity Sold')

# Customizations
axes[0].set_title('Total Quantity Sold, Sales and Profit by Region')
axes[0].set_ylabel('Amount (USD)')
ax2_secondary.set_ylabel('Quantity Sold (Units)')  
ax2_secondary.grid(False)  
axes[0].legend(title=None,loc='upper center')
ax2_secondary.legend(loc='upper center',bbox_to_anchor=(0.5, 0.92))
#------------------------------------------------------------------------------------------------------------------------------
### 2. Profit Margin vs. Average Discount by Region
sns.barplot(data=df_reg, x='region', y='profit_margin', palette='coolwarm',hue='region', ax=axes[1], width=0.8,legend=False)
sns.lineplot(data=df_reg, x='region', y='average_discount',ax=axes[1], color='green', marker='o',linewidth=2, linestyle='--', label='Average Discount')
axes[1].set_title('Profit Margin vs. Average Discount by Region')
axes[1].set_ylabel('Percentage (%)')
axes[1].legend(loc='best')
#-------------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flat:   
    ax.grid(True, linestyle='--', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)       
    ax.set_xlabel('')
#---------------------------------------------------------------------------------------------------------------------------
# Add Data Labels for Bar Charts
for ax in axes.flat:  
    fmt = '%.2f' if ax==axes[1] else '%.0f'
    for container in ax.containers:  
        ax.bar_label(container, fmt=fmt, color='black', fontweight='bold', fontsize=12, label_type='edge') 
#------------------------------------------------------------------------------------------------------------------------------
# Adjust layout 
plt.tight_layout(pad=3)

# Display the plots
plt.show()    
No description has been provided for this image

Some Insights:¶

West:

  • The West region consists of the following states: California, Oregon, Arizona, Nevada, Washington, Colorado, Utah, New Mexico, Idaho, Montana, and Wyoming.
  • This region has the highest total quantity sold (12,266 units), total sales (USD 725,457.82), and total profit (USD 108,418.45), making it the strongest performer. The biggest contributing states to sales are: California and Washington.
  • It contributes the highest share of total profit (37.86%) while accounting for 31.58% of total sales, highlighting its strong revenue generation.
  • The profit margin (14.94%) is the highest among all regions, suggesting efficient pricing and cost management.
  • The discount level (10.93%) is the lowest, reinforcing the idea that controlled discounting supports profitability.

East:

  • The East region consists of the following states: Pennsylvania, Delaware, Ohio, New York, New Jersey, Massachusetts, Maryland, Connecticut, New Hampshire, Maine, and Rhode Island, District of Columbia, Vermont, and West Virginia.
  • This region follows closely with total sales of USD 678,781.24 and profit of USD 91,522.78, with 10,618 units sold. The biggest contributors to sales are : New York, Pennsylvania and Ohio.
  • It contributes 29.55% of total sales and 31.96% of total profit, indicating a solid balance between revenue and profitability.
  • It has a profit margin (13.48%) slightly lower than the West.
  • Discounts are significantly higher at 14.54%, which may be slightly impacting profitability.

Central:

  • The Central region consists of the following states: Texas, Illinois, Michigan, Indiana, South Dakota, Wisconsin, Missouri, Minnesota, Iowa, Oklahoma, Nebraska, Kansas, and North Dakota.
  • This region has a moderate sales of USD 501,239.89 and the lowest profit level (USD 39,706.36) among all segments. The major contributing states to sales are: Texas, Illinois and Michigan
  • It represents 21.82% of total sales but only 13.86% of total profit, indicating weaker profitability.
  • The profit margin (7.92%) is the lowest among all regions, likely due to the highest discount rate (24.04%). The data suggests that excessive discounting is eroding profit margins in this region.

South: The South region consists of the following states:Kentucky, Georgia, Virginia, South Carolina, Louisiana, Arkansas,Tennessee, Florida, North Carolina, Mississippi,and Alabama.

  • This region has the lowest total quantity (6,209 units sold) and sales (USD 391,721.91),but achieves a higher profit (USD 46,749.43) than the Central region despite selling fewer units.The main contributots to sales are Florida and North Carolina.
  • It accounts for 17.05% of total sales and 16.32% of total profit, showing a relatively balanced but lower contribution.
  • The profit margin (11.93%) is better than Central but still lower than West and East. The discount rate (14.73%) is moderate but needs careful optimization to maintain profitability.

Note: The states with the highest sales are not always the ones with the highest profit and profit margin. For more details, see the section below on State Analysis.

Discount-Profitability Relationship:

The data appears to indicate a negative correlation between discount levels and profitability across regions:

  • West: Low Discount (10.93%) → Highest Profit Margin (14.94%).
  • East: Moderate Discount (14.54%) → Moderate Profit Margin (13.48%).
  • Central: High Discount (24.04%) → Lowest Profit Margin (7.92%).
  • South: Moderate Discount (14.73%) → Mid-Low Profit Margin (11.93%). This suggests that reducing discounts can significantly improve profitability, particularly in the Central region, where excessive discounts are cutting into profits.
  • 2) Regional Analysis over the Years¶
# a) Aggregate metrics at the Region level for each year
df_reg_year = df.groupby(['year', 'region']).agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_reg_year['sales_share (%)'] = (df_reg_year['total_sales'] / df_reg_year['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_reg_year['profit_share (%)'] = (df_reg_year['total_profit'] / df_reg_year['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_reg_year['profit_margin'] = (df_reg_year['total_profit'] / df_reg_year['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_reg_year = df_reg_year.sort_values(by='total_sales', ascending=False, ignore_index=True)

# Display the result
df_reg_year.round(2).head()
year region total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share (%) profit_share (%) profit_margin
0 2024 West 4270 250128.37 43808.96 10.52 397 10.89 15.30 17.51
1 2024 East 3411 213082.90 33230.56 14.78 354 9.28 11.60 15.60
2 2023 West 3025 187480.18 24051.61 11.72 329 8.16 8.40 12.83
3 2023 East 2839 180685.82 20141.60 14.28 303 7.87 7.03 11.15
4 2022 East 2382 156332.06 21091.01 13.48 246 6.81 7.36 13.49
# b) Visualizing Key Metrics by Region over the Years

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a figure with a 3x2 grid of subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(22, 20))

# Set the background color for the figure area
fig.patch.set_facecolor('lightgrey')
# ------------------------------------------------------------------------------------------------------------------------------

# Preparing Data for stacked plots: reshaping the data  using single pivot_table() 
df_reg_year_pivot = df_reg_year.pivot_table(index='year', columns='region', 
    values=['total_quantity_sold', 'total_sales', 'total_profit', 'number_of_customers'], 
    aggfunc='sum',fill_value=0)

# ------------------------------------------------------------------------------------------------------------------------------
# 1. Total Quantity by Region over the Years
df_reg_year_pivot['total_quantity_sold'].plot(kind='bar', stacked=True, ax=axes[0, 0], cmap='Set3', width=0.8)
axes[0, 0].set_title('Total Quantity Sold by Region over the Years')
axes[0, 0].set_ylabel('Quantity Sold (Units)')
# ------------------------------------------------------------------------------------------------------------------------------
# 2. Total Sales by Region over the Years
df_reg_year_pivot['total_sales'].plot(kind='bar', stacked=True, ax=axes[0, 1], cmap='Set3', width=0.8)
axes[0, 1].set_title('Total Sales by Region over the Years')
axes[0, 1].set_ylabel('Total Sales (USD)')
# ------------------------------------------------------------------------------------------------------------------------------
# 3. Total Profit by Region over the Years
df_reg_year_pivot['total_profit'].plot(kind='bar', stacked=True, ax=axes[1, 0], cmap='Set3', width=0.8)
axes[1, 0].set_title('Total Profit by Region over the Years')
axes[1, 0].set_ylabel('Total Profit (USD)')
# ------------------------------------------------------------------------------------------------------------------------------
# 4. Number of Customers by Region over the  Years
df_reg_year_pivot['number_of_customers'].plot(kind='bar', stacked=True, ax=axes[1, 1], cmap='Set3', width=0.8)
axes[1, 1].set_title('Number of Customers by Region over the Years')  
axes[1, 1].set_ylabel('Number of Customers')
# ------------------------------------------------------------------------------------------------------------------------------
# 5. Profit Margin by Region over the Years
sns.barplot(x='year', y='profit_margin', data=df_reg_year, hue='region', palette='Set3', ax=axes[2, 0], errorbar=None)
sns.lineplot(x='year', y='profit_margin', data=yearly_aggr, ax=axes[2, 0], color='green', label='Overall Profit Margin')  
axes[2, 0].set_title('Profit Margin by Region over the  Years')
axes[2, 0].set_ylabel('Profit Margin (%)')
# ------------------------------------------------------------------------------------------------------------------------------
# 6. Average Discount by Region over the Years
sns.barplot(x='year', y='average_discount', data=df_reg_year, hue='region', palette='Set3', ax=axes[2, 1], errorbar=None)
sns.lineplot(x='year', y='average_discount', data=yearly_aggr, ax=axes[2, 1], color='orange', label='Overall Discount')
axes[2, 1].set_title('Average Discount by Region over the Years')
axes[2, 1].set_ylabel('Discount (%)')
#----------------------------------------------------------------------------------------------------------------------------    
# Apply Common Customizations to all Subplots
for ax in axes.flat:   
    ax.grid(True, linestyle='--', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)       
    ax.set_xlabel('')
    ax.legend(loc='best')  
#---------------------------------------------------------------------------------------------------------------------------
for ax in axes.flat:
    if ax not in [axes[2, 0], axes[2, 1]]: 
        for container in ax.containers:
            ax.bar_label(container, fmt='%.0f', color='black', fontweight='bold', fontsize=12, label_type='center')        
        
#-------------------------------------------------------------------------------------------------------------------------
# Adjust layout 
plt.tight_layout(pad=3)

# Display the plots
plt.show()    
No description has been provided for this image

Some Insights:¶

1. Total Quantity by Region Over the Years:

  • Overall, total quantity sold has increased across all regions, reaching its highest level in 2024. However, the West region experienced a slight dip in 2022 before rebounding in subsequent years.
  • The West region consistently leads in total quantity sold, contributing the highest volume each year.
  • The East region follows as the second-highest contributor, with the Central region in third and the South region consistently having the lowest sales volume. The steady increase in total quantity suggests growing market penetration across all regions.

2. Total Sales by Region Over the Years:

  • Apart from the East region, which exhibited consistent growth over the years, all other regions experienced a decline in 2022 before rebounding in the following years. However, the Central region saw a decline in total profit again in 2024, despite an increase in total sales.
  • The West region continues to consistently lead in total sales, except in 2022, when it was overtaken by the East region, which generally ranks as the second-highest contributor.
  • The South region remains the lowest-performing in sales, trailing behind the Central region in most years, except for 2021.

3. Total Profit by Region Over the Years:

  • The West region exhibits steady profit growth over time, while the other regions experience fluctuations with periods of decline and recovery. The East region saw profit growth in the first two years, dipped in 2023, and rebounded in 2024. Meanwhile, the Central region maintained consistent profit growth for three years before declining in 2024.
  • The West region is generally the highest profit contributor, except in 2022 when the East region briefly overtook it. The Central and South regions consistently report lower profit levels, with more pronounced fluctuations.

4. Number of Customers by Region Over the Years:

  • The total number of customers has generally increased across all regions, reaching its highest level in 2024. However, the South and Central regions saw a slight decline in 2022 compared to 2021 before recovering in subsequent years.
  • The West region consistently maintains the largest customer base, followed by the East region. The Central region ranks third, while the South region has the smallest customer base throughout the period.

5. Profit Margin by Region Over Year:

  • The West region typically exhibits the highest profit margin over the years, except in 2023, when it was surpassed by both the South and Central regions.
  • The East region consistently follows the West in profit margin, while the South region tracks just behind the East. The Central region generally lags behind the others, with the exception of 2023, when it briefly outperformed the South.

6. Average Discount by Region Over Year:

  • The Central region experiences the highest average discount levels, which may explain its ongoing struggles with profitability.
  • The West region, on the other hand, benefits from the lowest discount levels over the years, contributing to its strong profitability.
  • Both the South and East regions benefit from moderate discount levels, striking a balance between sales and profitability.

3.3.4.3.2. State level Analysis¶

  • 1) State Analysis over the Entire Period¶
# a) Aggregate metrics at the state level for the entire period

df_state = df.groupby('state').agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'), 
    number_of_customers=('customer_id', 'nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_state['sales_share'] = (df_state['total_sales'] / df_state['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_state['profit_share'] = (df_state['total_profit'] / df_state['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_state['profit_margin'] = (df_state['total_profit'] / df_state['total_sales']) * 100

# Sort the DataFrame by total sales in descending order and reset the index
df_state = df_state.sort_values(by='total_sales', ascending=False, ignore_index=True)

# Display the result
df_state.round(2).head()
state total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 California 7667 457687.63 76381.39 7.28 577 19.92 26.67 16.69
1 New York 4224 310876.27 74038.55 5.53 415 13.53 25.85 23.82
2 Texas 3724 170188.05 -25729.36 37.02 370 7.41 -8.98 -15.12
3 Washington 1883 138641.27 33402.65 6.40 224 6.04 11.66 24.09
4 Pennsylvania 2153 116511.91 -15559.96 32.86 257 5.07 -5.43 -13.35
# b) Visualization: Top-performing States over the Entire Period

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})

# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(22, 10))

# Set background color for the entire figure
fig.patch.set_facecolor('lightgrey')
# -----------------------------------------------------------------------------------------------------------------------------
### 1. Top 10 Performing States by Quantity Sold

# Sort by total quantity and select top 10
top_10_state_quantity = df_state.nlargest(10, 'total_quantity_sold')

# Barplot for Quantity Sold
sns.barplot(x='state', y='total_quantity_sold', hue='state', data=top_10_state_quantity, palette='Dark2', ax=axes[0,0], alpha=0.85, legend=False)
axes[0, 0].set_title('Top 10 Performing States by Quantity Sold')
axes[0, 0].set_ylabel('Quantity Sold (Units)')

# -------------------------------------------------------------------------------------------------------------------------------
### 2. Top 10 Performing States by Sales

# Sort by total sales and select top 10
top_10_state_sales = df_state.nlargest(10, 'total_sales')

# Barplot for Sales
sns.barplot(x='state', y='total_sales', hue='state', data=top_10_state_sales,palette='Dark2', ax=axes[0,1], alpha=0.85, legend=False)
axes[0, 1].set_title('Top 10 Performing States by Sales')
axes[0, 1].set_ylabel('Total Sales (USD)')

# ------------------------------------------------------------------------------------------------------------------------------
### 3. Top 10 Performing States by Profit

# Sort by total profit and select top 10
top_10_state_profit =df_state.nlargest(10, 'total_profit')

# Barplot for Profit
sns.barplot(x='state', y='total_profit', data=top_10_state_profit, palette='Dark2',hue='state',ax=axes[1, 0], alpha=0.85, legend=False)
axes[1, 0].set_title('Top 10 Performing States by Profit')
axes[1, 0].set_ylabel('Total Profit (USD)')

# -------------------------------------------------------------------------------------------------------------------------------
### 4. Top 10 Performing States by Profit Margin

# Sort by profit margin and select top 10
top_10_state_margin =df_state.nlargest(10, 'profit_margin')

# Barplot for Profit Margin
sns.barplot(x='state', y='profit_margin', data=top_10_state_margin, palette='Dark2', hue='state', ax=axes[1, 1], alpha=0.85, legend=False)

# Overlay lineplot for Average Discount
sns.lineplot(x='state', y='average_discount', data=top_10_state_margin,color='red', marker='o', ax=axes[1, 1], alpha=0.85, linewidth=2)
axes[1, 1].set_title('Top 10 Performing States by Profit Margin')
axes[1, 1].set_ylabel('Profit Margin (%)')

# ----------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flat:
    ax.set_xlabel(None)
    ax.grid(True, linestyle=':', alpha=0.7)
    ax.tick_params(axis='x', rotation=45)
#-------------------------------------------------------------------------------------------------------------------------
# Add Data Labels for Bar Charts
for ax in axes.flat:  
    fmt = '%.2f' if ax==axes[1,1] else '%.0f'
    for container in ax.containers:  
        ax.bar_label(container, fmt=fmt, color='black', fontweight='bold', fontsize=12, label_type='edge')  
#----------------------------------------------------------------------------------------------------------------------------
# Adjust layout 
plt.tight_layout(pad=3)

# Display the plots
plt.show()    
No description has been provided for this image

Some Insights :¶

1️. Quantity Sold (Top 10 Performing States):

  • California (7,667 Units), New York (4,224), and Texas (3,724) lead in quantity sold, accounting for a significant portion of total sales.
  • Pennsylvania, Washington, Illinois, and Ohio form the mid-tier, with sales between 1,700 - 2,200 units.
  • Florida (1,379), North Carolina (983), and Michigan (946) have the lowest quantity sold among the top 10.

2. Total Sales (Top 10 Performing States):

  • California (USD 457,688) and New York (USD 310,876) dominate total sales, reinforcing their market influence.
  • Texas, Washington, and Pennsylvania contribute moderately, with sales between USD 116,000 - 170,000.
  • The lowest-performing states in sales—Michigan (USD 76,270) and Virginia (USD 70,637)—remain significant but lag behind the top-tier states. A key observation is that some states (e.g., Florida) perform well in total sales despite lower quantity sold, suggesting a higher average selling price.

3. Total Profit (Top 10 Performing States):

  • California (USD 76,381) and New York (USD 74,039) remain the most profitable states, aligning with their high sales figures.
  • Washington (USD 33,403) follows, showing a strong profit-to-sales ratio.
  • Michigan (USD 24,463) and Virginia (USD 18,598) perform relatively well in profit, despite relatively moderate sales figures.

Some states with high sales (Texas, Pennsylvania, Florida, Illinois, Ohio) are absent from the top 10 in profit and even exhibit negative profit, suggesting either higher costs or discounting strategies affecting margins.

4. Profit Margin (Top 10 Performing States):

  • District of Columbia (36.98%), Delaware (36.35%), and Minnesota (36.24%) have the highest profit margins, indicating strong profitability relative to sales.
  • Maine, Arkansas, Indiana, and Georgia also maintain high profit margins, ranging from 33% to 36%.
  • Michigan (32.07%) rounds out the top 10, demonstrating solid profitability despite lower total sales.

Notably, some states with high total profit (such as California, New York, and Washington) are absent from the top profit margin rankings. This suggests that these states achieve strong overall profit through high sales volume rather than high per-unit profitability. Possible reasons include aggressive discounting, higher operational costs, or a lower-margin product mix.

  • 2) State Analysis over the Years¶
# a) Aggregate metrics at the state level for each year

df_state_year = df.groupby(['year', 'state']).agg(
    total_quantity_sold=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate percentage contribution to total sales
df_state_year ['sales_share'] = (df_state_year['total_sales'] / df_state_year['total_sales'].sum()) * 100

# Calculate percentage contribution to total profit
df_state_year['profit_share'] = (df_state_year['total_profit'] / df_state_year['total_profit'].sum()) * 100

# Calculate profit margin as a percentage
df_state_year['profit_margin'] = (df_state_year ['total_profit'] / df_state_year ['total_sales']) * 100

# Display the result 
df_state_year.round(2).head()
year state total_quantity_sold total_sales total_profit average_discount number_of_customers sales_share profit_share profit_margin
0 2021 Alabama 60 6139.09 1250.73 0.00 7 0.27 0.44 20.37
1 2021 Arizona 201 8295.24 -903.20 28.96 24 0.36 -0.32 -10.89
2 2021 Arkansas 82 6302.69 2031.69 0.00 7 0.27 0.71 32.24
3 2021 California 1537 91303.53 12637.95 7.12 181 3.97 4.41 13.84
4 2021 Colorado 114 6502.29 438.19 27.24 12 0.28 0.15 6.74
# Visualization: Top-Performing States Over the Years

# Set Seaborn theme 
sns.set_theme( style='whitegrid',rc={"axes.titlesize": 16, "axes.titleweight": "bold", "axes.labelsize": 14, "axes.labelweight": "bold",
        "legend.fontsize": 11, "xtick.labelsize": 12, "ytick.labelsize": 12})

# Create a 2x2 grid of subplots for different performance metrics
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(22, 18))

# Set background color for better contrast
fig.patch.set_facecolor('lightgrey')

# -----------------------------------------------------------------------------------------------------------------------------
# 1. Top 5 States by Quantity Sold over the Years

# Identify the top 5 states based on total quantity
top_5_state_quantity = df_state.nlargest(5, 'total_quantity_sold')

# Filter yearly data for only the top 5 states
top_5_state_quantity_year = df_state_year[df_state_year['state'].isin(top_5_state_quantity['state'])]

# Pivot data for easier plotting
state_quantity_year_pivot = top_5_state_quantity_year.pivot(index='year', columns='state', values='total_quantity_sold')

# Plot a stacked bar chart to show yearly total quantity trends
state_quantity_year_pivot.plot(kind='bar', stacked=True, ax=axes[0, 0], width=0.8, cmap='tab10')
axes[0, 0].set_title('Top 5 States by Quantity Sold over the Years')
axes[0, 0].set_ylabel('Quantity Sold (Units)')

# -----------------------------------------------------------------------------------------------------------------------------
# 2. Top 5 States by Sales over the Years

# Identify the top 5 states based on total sales
top_5_state_sales = df_state.nlargest(5, 'total_sales')

# Filter yearly data for only the top 5 states
top_5_state_sales_year = df_state_year[df_state_year['state'].isin(top_5_state_sales['state'])]

# Pivot data for easier plotting
state_sales_year_pivot = top_5_state_sales_year.pivot(index='year', columns='state', values='total_sales')

# Plot a stacked bar chart to show yearly total sales trends
state_sales_year_pivot.plot(kind='bar', stacked=True, ax=axes[0, 1], width=0.8, cmap='tab10')
axes[0, 1].set_title('Top 5 States by Sales over the Years')
axes[0, 1].set_ylabel('Total Sales (USD)')

# -----------------------------------------------------------------------------------------------------------------------------
# 3. Top 5 States by Profit over the Years

# Identify the top 5 states based on total profit
top_5_state_profit = df_state.nlargest(5, 'total_profit')

# Filter yearly data for only the top 5 states
top_5_state_profit_year = df_state_year[df_state_year['state'].isin(top_5_state_profit['state'])]

# Pivot data for easier plotting
state_profit_year_pivot = top_5_state_profit_year.pivot(index='year', columns='state', values='total_profit')

# Plot a stacked bar chart to show yearly total profit trends
state_profit_year_pivot.plot(kind='bar', stacked=True, ax=axes[1, 0], width=0.8, colormap='tab10')
axes[1, 0].set_title('Top 5 States by Profit over the Years')
axes[1, 0].set_ylabel('Total Profit (USD)')

# -----------------------------------------------------------------------------------------------------------------------------
# 4. Top 5 States by Profit Margin over the Years

# Identify the top 5 states based on profit margin
top_5_state_margin = df_state.nlargest(5, 'profit_margin')

# Filter yearly data for only the top 5 states
top_5_state_margin_year = df_state_year[df_state_year['state'].isin(top_5_state_margin['state'])]

# Plot a grouped bar chart for profit margin trends across years
sns.barplot(x='year', y='profit_margin', data=top_5_state_margin_year, hue='state',palette='tab10', ax=axes[1, 1], errorbar=None)

# Overlay a line plot showing the overall profit margin trend
sns.lineplot(x='year', y='profit_margin', data=yearly_aggr, ax=axes[1, 1], color='green', label='Overall Profit Margin')

# Set plot title and labels
axes[1, 1].set_title('Top 5 States by Profit Margin over the Years')
axes[1, 1].set_ylabel('Profit Margin (%)')
# -----------------------------------------------------------------------------------------------------------------------------
# Apply Common Customizations to all Subplots
for ax in axes.flat:
    ax.tick_params(axis='x', rotation=45) 
    ax.set_xlabel('') 
    ax.legend(loc='best')

# -----------------------------------------------------------------------------------------------------------------------------
# Add Data Labels to Bars
for ax in axes.flat:
    if ax != axes[1, 1]:  # Exclude the profit margin plot (axes[1, 1])
        for container in ax.containers:
            ax.bar_label(container, fmt='%.0f', color='black', fontweight='bold', fontsize=12, label_type='center')
# -----------------------------------------------------------------------------------------------------------------------------
# Adjust Layout 
plt.tight_layout(pad=3)

# Display the Plots
plt.show()
No description has been provided for this image

Some Insights:¶

1. Top 5 States by Quantity Sold over the Years:

  • Quantity sold has globally increased across all regions, reaching its highest level in 2024. However, some States like California,Washington and Texas experienced a dip in 2022 before rebounding in subsequent years.
  • California consistently leads in total quantity sold across all years, showing strong year-over-year growth.
  • New York held second place in total quantity sold, except in 2021, when Texas slightly surpassed it.
  • Washington and Pennsylvania have the lowest total quantities among the top 5 states, with moderate growth over the years.

2. Top 5 States by Total Sales over the Years:

  • New York and Texas consistently rank after California, with Washington following closely in most years.
  • Pennsylvania, despite having typically lower total sales, shows the most consistent year-over-year growth.
  • 2024 marks the highest sales figures for all top 5 states, except for Texas with its peak in 2021, indicating strong revenue growth.

3. Top 5 States by Total Profit over the Years:

  • California and New York are the leading profit contributors over the years
  • Michigan and Virginia, which were not among the top states in quantity or sales, make a notable appearance in profit. In the case of Michigan, a consistent increasing trend is observed.
  • Washington's profit contribution decreased consistently during the first three years but surged significantly in 2024, possibly due to improved margins or an increase in high-value sales.

4. Top 5 States by Profit Margin over the Years:

  • Arkansas, Delaware, Maine, Minnesota, and the District of Columbia rank as the top 5 states in terms of profit margin (%).
  • The District of Columbia shows the highest profit margins in most years, despite not leading in total sales or profit.
  • We can observe that the states with the highest profit margins do not correspond with those generating the most sales, highlighting potential differences in pricing strategy or operational efficiency.

3.3.5. Correlation Analysis of Key Metrics¶

Objective: Examine relationships between key metrics, with a focus on the impact of discounts on profitability.

# ------------------------------------------Note : Correlation Coefficients----------------------------------------------------
# Correlation coefficients measure the strength and direction of the relationship between two variables.
# One of the most commonly used methods is the Pearson correlation coefficient. 
# However, for this method to yield accurate and meaningful results, certain conditions must be satisfied:
# - Linear Relationship: The relationship between the variables should be linear.
# - Normality: The data should be approximately normally distributed.
# - Absence of Significant Outliers: Significant outliers can disproportionately influence the results and lead to misleading conclusions.

# In addition to Pearson, there are non-parametric correlation coefficients that are based on the ranks of the data rather than their actual values. 
# Notable examples include the Spearman and Kendall correlation coefficients. These methods offer several advantages over Pearson's coefficient:
# - Distribution-Free: Non-parametric measures do not assume that the data follows a normal distribution.
# - Handles Nonlinear Relationships: They can capture monotonic relationships, even if they are not strictly linear.
# - Robust to Outliers: Rank-based methods are less sensitive to outliers and can be used effectively with ordinal data.

                           # Project Decisions 
# 1) Choice of Kendall method: 
# We have chosen to use a non-parametric correlation coefficient, specifically the Kendall method, to measure the relationship between our numerical metrics. 
# This approach ensures robust and meaningful insights, even in the presence of non-normal distributions, nonlinearity, or outliers.

# 2) Use of Aggregated data:  
# Instead of using individual transactional data, we opted to aggregate daily data for scatter plots.
# This involves grouping data by order_date and computing key metrics per day,offering several advantages:
     # - Enhanced Clarity: Reduces clutter in scatter plots, making patterns and trends easier to identify.
     # - Improved Interpretability: Correlation coefficients become less sensitive to random noise in individual transactions.
     # - Business Alignment: Matches common industry practices that analyze performance metrics at daily, monthly, or quarterly levels.
# a) Aggregate metrics for each day
aggr_daily = df.groupby('order_date').agg(
    total_quantity=('quantity', 'sum'),     
    total_sales=('sales', 'sum'),          
    total_profit=('profit', 'sum'),        
    average_discount=('discount', 'mean'),
    number_of_customers=('customer_id','nunique')
).reset_index()

# Calculate profit margin as a percentage
aggr_daily['profit_margin'] = (aggr_daily['total_profit'] / aggr_daily['total_sales']) * 100

# Display the result 
aggr_daily .round(2).head()
order_date total_quantity total_sales total_profit average_discount number_of_customers profit_margin
0 2021-01-03 2 16.45 5.55 20.0 1 33.75
1 2021-01-04 8 288.06 -65.99 40.0 1 -22.91
2 2021-01-05 3 19.54 4.88 20.0 1 25.00
3 2021-01-06 30 4407.10 1358.05 0.0 3 30.82
4 2021-01-07 10 87.16 -71.96 70.0 1 -82.57
# b) Visualization: Scatter Plots with Regression Lines and Correlation Heatmap

# Set the seaborn theme
sns.set_theme(style='whitegrid',rc={"axes.titlesize": 16,"axes.titleweight": "bold","axes.labelsize": 14,"legend.fontsize": 11,
                                    "xtick.labelsize": 12,"ytick.labelsize": 12,"axes.labelweight": "bold"})
# Set the figure and axes
fig, axes = plt.subplots(3, 2, figsize=(16, 14))
fig.suptitle('Scatter Plots, Regression Lines, and Correlation Analysis', fontsize=16, y=0.95, weight='bold')

# Plot 1: Number of Customers vs Quantity Purchased
sns.regplot(x='number_of_customers', y='total_quantity', data=aggr_daily, ax=axes[0, 0],color='orangered', scatter_kws={'alpha': 0.7}, line_kws={'linewidth': 2})
axes[0, 0].set_title('Number of Customers vs Quantity Purchased')
axes[0, 0].set_xlabel('Number of Customers')
axes[0, 0].set_ylabel('Quantity Purchased (Units)')
axes[0, 0].grid(True, linestyle='--', alpha=0.5)

# Plot 2: Impact of Discount on Quantity Purchased
sns.regplot(x='average_discount', y='total_quantity', data=aggr_daily, ax=axes[0, 1],color='steelblue', scatter_kws={'alpha': 0.7}, line_kws={'linewidth': 2})
axes[0, 1].set_title('Impact of Discount on Quantity Purchased')
axes[0, 1].set_xlabel('Discount (%)')
axes[0, 1].set_ylabel('Quantity Purchased (Units)')
axes[0, 1].grid(True, linestyle='--', alpha=0.5)

# Plot 3: Impact of Discount on Sales
sns.regplot(x='average_discount', y='total_sales', data=aggr_daily, ax=axes[1, 0], color='darkorange', scatter_kws={'alpha': 0.7}, line_kws={'linewidth': 2})
axes[1, 0].set_title('Impact of Discount on Sales')
axes[1, 0].set_xlabel('Discount (%)')
axes[1, 0].set_ylabel('Sales (in USD)')
axes[1, 0].grid(True, linestyle='--', alpha=0.5)

# Plot 4: Impact of Discount on Profit
sns.regplot(x='average_discount', y='total_profit', data=aggr_daily, ax=axes[1, 1],color='mediumpurple', scatter_kws={'alpha': 0.7}, line_kws={'linewidth': 2})
axes[1, 1].set_title('Impact of Discount on Profit')
axes[1, 1].set_xlabel('Discount (%)')
axes[1, 1].set_ylabel('Profit (in USD)')
axes[1, 1].grid(True, linestyle='--', alpha=0.5)

# Plot 5: Impact of Discount on Profit Margin
sns.regplot(x='average_discount', y='profit_margin', data=aggr_daily, ax=axes[2, 0],color='orange', scatter_kws={'alpha': 0.7}, line_kws={'linewidth': 2})
axes[2, 0].set_title('Impact of Discount on Profit Margin')
axes[2, 0].set_xlabel('Discount (%)')
axes[2, 0].set_ylabel('Profit Margin (%)')
axes[2, 0].grid(True, linestyle='--', alpha=0.5)

# Plot 6: Correlation Heatmap
correlation_matrix = aggr_daily[['total_quantity', 'total_sales', 'total_profit', 'profit_margin', 'average_discount', 'number_of_customers']].corr(method='kendall')
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True, ax=axes[2, 1],annot_kws={'fontsize': 10, 'weight': 'bold'}, linewidths=0.5, linecolor='white')
axes[2, 1].set_title('Correlation Analysis of Key Metrics (Kendall)')

# Adjust spacing and layout
plt.tight_layout(pad=3)
plt.show()
No description has been provided for this image

Insights:¶

1. Number of Customers vs. Quantity Purchased (Units):

  • The scatter plot shows a clear positive linear relationship between the number of customers and the quantity purchased. As the number of customers increases, so does the quantity purchased, suggesting that higher customer volume tends to lead to more products being bought. This is consistent with the understanding that more customers generally mean more purchases, creating a positive trend.
  • The correlation coefficient is 0.72, indicating a strong positive correlation. This supports the conclusion from the scatter plot, where an increase in customers is strongly associated with an increase in the quantity purchased.

2. Impact of Discount on Quantity Purchased (Units):

The scatter plot reveals a nearly horizontal (very weak) relationship between the discount percentage and the quantity purchased. This indicates that higher discounts do not have a significant effect on the quantity of products purchased, suggesting that discounts may not play a crucial role in driving purchase volume in this case.

Several factors could explain this outcome. For example:

  • Customers may not be highly sensitive to price changes or discounts for these specific products, meaning that discounts don’t strongly influence their purchasing decisions.
  • The products could have relatively stable demand, implying that other factors (such as product necessity or brand loyalty) are more influential in driving purchases.
  • The discount percentages offered might not be large enough to substantially impact consumer behavior, especially if the discounts are perceived as minimal or inconsequential. Additionally, external factors, such as availability, seasonality, or competitive pricing, could be playing a more significant role in determining purchase behavior. The correlation coefficient of 0.07 further supports the scatter plot’s observation, indicating a very weak positive correlation. This reinforces the notion that there is no clear trend between the discount and the quantity purchased, underscoring the limited impact of discounts in this scenario.

3. Impact of Discount on Sales (USD):

  • The scatter plot for the discount percentage vs. sales also shows a nearly horizontal (very weak) relationship. This suggests that higher discounts do not strongly influence sales revenue, which was expected since discounts, while reducing the price, do not lead to a significant increase in the quantity sold.

  • The correlation coefficient of 0.02 indicates a very weak positive correlation, even weaker than the correlation for quantity purchased. This supports the scatter plot observation that discount percentages do not significantly impact total sales, further reinforcing the idea that other factors, such as customer demand or product value, play a stronger role in determining sales performance.

4. Impact of Discount on Profit:

  • The scatter plot for discount percentage versus profit reveals a moderate negative relationship. As the discount percentage increases, profit tends to decrease.
  • The correlation coefficient of -0.27 supports this observation, confirming a downward trend between discount percentage and profit. This indicates that deeper discounts negatively affect profitability.
  • This may be explained by the fact that discounts have limited impact on the quantity purchased, as evidenced earlier, meaning the revenue gains from increased sales volume may not fully compensate for the losses caused by reduced prices.

5. Impact of Discount on Profit Margin:

  • The scatter plot for discount percentage vs. profit margin shows a moderate negative relationship. As the discount percentage increases, the profit margin decreases, confirming that giving bigger discounts directly reduces the profitability of each sale. This relationship is quite clear and linear, suggesting that profit margins are highly sensitive to discount levels.

  • The correlation coefficient is -0.44, which matches the scatter plot's negative trend. This reinforces the interpretation from the scatter plot, showing that larger discounts significantly reduce the profit margin for each sale.

===============================================================================================================================

Section 4: Conclusion and Recommendations¶

This project aimed to leverage data visualization techniques to extract actionable insights from the Superstore dataset. Through comprehensive data wrangling, exploratory analysis, and visualization, key trends and patterns were identified across various business dimensions, including sales performance, customer segmentation, product profitability, and regional distribution.

Key findings include:¶

  • Sales & Profit Trends: Sales exhibited a clear yearly upward trend, with strong seasonal patterns at the quarterly and monthly levels. Q4 consistently saw peak sales, driven by holiday shopping and corporate purchases. These insights highlight opportunities for targeted promotions and optimized inventory planning.

  • Product Performance: Specific categories and subcategories were identified as high-performing in terms of both sales volume and profit margin, while others showed lower profitability due to high discounts or operational costs.

  • Customer Segmentation: Different customer segments exhibited distinct purchasing behaviors, with Corporate clients contributing significantly to revenue.

  • Regional Performance: Variability in sales and profitability across regions highlighted opportunities for targeted marketing strategies.

Recommendations:¶

Based on the insights gained, the following strategic recommendations are proposed:

1. Optimize Inventory for High-Demand Periods:

  • Align stock levels with observed seasonal trends to prevent stockouts and maximize sales potential.
  • Consider demand forecasting models for better inventory management.

2. Improve Pricing and Discount Strategies:

  • Reduce deep discounts on low-margin products to improve overall profitability.
  • Implement dynamic pricing strategies to maximize revenue from high-demand products.

3. Enhance Customer Targeting:

  • Tailor marketing campaigns for different customer segments, leveraging insights into their purchasing patterns.
  • Introduce loyalty programs or exclusive deals to retain high-value customers.

4. Regional Expansion & Strategy Refinement:

  • Invest in regions with high sales potential but lower penetration by refining marketing efforts.
  • Evaluate underperforming regions for possible strategy adjustments.

Final Thoughts & Next Steps¶

The insights from this analysis provide a data-driven foundation for improving profitability and operational efficiency. By refining pricing strategies, optimizing regional sales efforts, and enhancing customer engagement, the business can achieve sustainable growth.

To deepen insights, future projects could incorporate advanced statistical analyses and predictive modeling to further optimize decision-making.


References¶

Books¶

  • McKinney, W. (2022). Python for data analysis: Data wrangling with pandas, NumPy, and Jupyter (3rd ed.). O'Reilly Media. ISBN: 9781098104030.
  • VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. O'Reilly Media. ISBN: 9781491912058.

Documentation¶

  • Pandas Development Team. (2024). pandas: Powerful Python data analysis toolkit (Version 2.2.3).Retrieved from https://pandas.pydata.org

  • Matplotlib Development Team. (2024). Matplotlib documentation (Version 3.10.0). Retrieved from https://matplotlib.org/stable/index.html

  • Waskom, M. L. (2024). Seaborn: Statistical data visualization (Version 0.13.2).Retrieved from https://seaborn.pydata.org

  • NumPy Development Team. (2024). NumPy documentation (Version 2.2.0). Retrieved from https://numpy.org/doc/

Videos¶

  • Alex. (2023, June 6). Exploratory data analysis in pandas | Python pandas tutorials [Video]. YouTube. https://www.youtube.com/watch?v=Liv6eeb1VfE

  • Data School. (n.d.). Matplotlib tutorials [Video playlist]. YouTube. Retrieved from https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y

  • Galli, K. (2019, June 1). Intro to data visualization in Python with Matplotlib (line graph, bar chart, title, labels, size) [Video]. YouTube. https://www.youtube.com/watch?v=DAQNHzOcO5A

  • Galli, K. (2024,June 29). Complete Python pandas data science tutorial! [Video]. YouTube. https://www.youtube.com/watch?v=2uvysYbKdjM

  • Schafer, C. (2020). Pandas tutorials [Video playlist]. YouTube. Retrieved from https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS